Concentration inequalities and non-asymptotic confidence intervals
We discuss in this section a few basic concentration inequalties, Chernov bounds, Hoeffding bounds and McDiarmid bounds and give some applications. Concentration inequalties are a very important tools to provide performance guarantees for Monte-Carlo methods, but also in a variety of other contexts in Uncertainty Quantification and Machine Learning. For further reading we recommend the excellent book (Boucheron, Lugosi, and Massart 2013).
Chernov bounds
We start with a basic but fundamental bound
Theorem 4.1 (Markov and Chernov inequalities)
Markov inequality: Suppose Y\ge 0 is a non-negative random variable and a>0 a positive number. Then
P(Y \ge a) \le \frac{E[Y]}{a}\,.
\tag{4.1}
Chernov inequality. Suppose X is a random variable and a \in \mathbb{R} then
\begin{aligned}
& P(X \ge a) \le \inf_{t \ge 0} \frac{E[e^{tX}]}{e^{ta}} \\
& P(X \le a) \le \inf_{t \le 0} \frac{E[e^{tX}]}{e^{ta}}
\end{aligned}
\tag{4.2}
Proof. If Y\ge 0 then \displaystyle Y=Y 1_{\{Y\ge a\}} + Y1_{\{Y\le a\}} \ge Y1_{\{Y\ge a\}} \ge a 1_{\{Y\ge a\}} and thus \displaystyle E[Y] \ge a P(Y\ge a) which is Equation 4.1.
Using Markov inequality we have for t \ge 0
P(X \ge a) = P\left( e^{tX} \ge e^{ta} \right) \le \frac{E[e^{tX}]}{e^{ta}} \,.
Optimizing over t gives the first inequality in Equation 4.2. The second inequality is obtained by replacing X by -X.
Examples of Chernov bounds
- If X is normal with mean \mu and varaince \sigma^2, then M(t)=e^{\mu t + \frac{\sigma^2 t^2}{2}} and, for a = \mu + \varepsilon, Chernov gives
P(X \ge \mu + \varepsilon) \le \inf_{t \ge 0} \frac{e^{\mu t + \frac{\sigma^2 t^2}{2}} }{e^{t(\mu + \varepsilon)}}
= e^{ -\sup_{t\ge 0}\left\{ t\varepsilon - \frac{\sigma^2 t^2}{2}\right\} } = e^{-\frac{\varepsilon^2}{2\sigma^2}}
This is sharp: one can show (see exercises) that
\left(\frac{1}{\varepsilon}-\frac{1}{\varepsilon^3\sigma^2} \right)\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}} \le P(X \ge \mu + \varepsilon) \le \frac{1}{\varepsilon}\frac{1} {\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}}
- Consider X_n is binomial with parameters (n,p) then M(t)=\left(pe^{t}+(1-p)\right)^n. We will use that
\sup_{t} \left\{ tz - \ln(pe^t + (1-p)) \right\} =
\left\{
\begin{array}{cl}
+ \infty & \textrm{ if } z<0 \textrm{ or } z>1 \\
z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} := H(z||p) & \textrm{ if } 0 \le z \le 1
\end{array}
\right.
H(z||p)=z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} is the KL-divergence or (relative entropy) between Bernoulli distributions with paramters z and p respectively.
Chernov bound gives
P\left(X_n \ge n(p+\varepsilon)\right) \le \inf_{t >0} \frac{(pe^t + (1-p))^n}{e^{tn(p+\varepsilon)}}= e^{-n \sup_{t\ge 0}\left\{ t(p+\epsilon)-\ln(pe^t+(1-p)) \right\}} = e^{-n H( p+\varepsilon||p)}
and similarly P\left(X_n \le n(p-\varepsilon)\right) \le e^{-n H( p-\varepsilon||p)}
Hoeffding’s bound
- Chernov bounds are very sharp but requires knowledge of the mgf M_X(t).
- One of the main idea behind concentration inequalties: given X bound M_X(t) \le M(t) by the mfg of a random variable Y which you know explcitly. Mostly here we take Y a Gaussian but one also uses other, Bernoulli, Poisson, Gamma, etc…
- The following elementary bound will be used repeatedly.
Lemma 4.1 (Hoeffding’s bound) Suppose a \le X \le b with probability 1. Then for any \varepsilon > 0
Bound on the variance \displaystyle \textrm{Var}(X) \le \frac{(b-a)^2}{4}
Bound on the mgf \displaystyle E\left[ e^{tX} \right] \le e^{t E[X]} e^{\frac{t^2 (b-a)^2}{8}}
Proof. For the bound on the variance a \le X \le b implies that \displaystyle -\frac{a-b}{2} \le X - \frac{a+b}{2} \le \frac{a-b}{2} and therefore
\textrm{Var}(X) = \textrm{Var}\left(X - \frac{a+b}{2}\right) \le E\left[\left(X - \frac{a+b}{2}\right)^2\right] \le \frac{(b-a)^2}{4}\,.
Since X is bounded the moment generating function M(t) =\frac{e^{tX}}{E[e^{tX}]} exists for any t \in \mathbb{R}. To bound the M(t) let us consider instead its logarithmx u(t)=\ln M(t). We have
\begin{aligned}
u'(t) &= \frac{M'(t)}{M(t)} &=& E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right] \\
u''(t) &= \frac{M''(t)}{M(t)}- \left(\frac{M'(t)}{M(t)}\right)^2 &=& E\left[ X^2 \frac{e^{tX}}{E[e^{tX}]} \right] - E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right]^{2}
\end{aligned}
We recognize u''(t) as the variance under the tilted measure with tilted density \frac{e^{tX}}{E[e^{tX}]} and thus by part 1. (applied to the tilted measure) we have u''(t) \le \frac{(b-a)^2}{4}.
Using the Taylor expansion with remainder we have, for some \xi between 0 and t
\ln M(t)= u(t) = u(0) + u'(0)t + u''(\xi) \frac{t^2}{2} \le t E[X] + \frac{t^2(b-a)^2}{8}\,.
This concludes the proof.
Remark: The bound on the variance in 1. is optimal. Indeed taking without loss of generality a=0 and b=1 then the variance is bounded by 1/4 and this realized by taking X to be a Bernoulli with p=\frac12. This bound says that the RV with the largest variance is the one where the mass is distributed at the end point.
The bound in 2. is optimal only in the sense that it is the best quadratic bound on u(t). For example for a Bernoulli with a=0 and b=1 we have M(t)=\ln (\frac{1}{2}e^{t} + \frac{1}{2})= \frac{1}{2}t + \ln \cosh\left(\frac{t}{2}\right) which is much smaller (for large t). There is room sometimes for better bounds (see e.g. (Boucheron, Lugosi, and Massart 2013)) but ising Gaussian is computationally convenient.
Hoeffding’s theorem
Theorem 4.2 (Hoeffding’s Theorem) Suppose X_1, \cdots, X_N are independent random variables such that a_i \le X \le b_i (almost surely). Then
\begin{align}
P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \ge \varepsilon \right)
&\le e^{- \frac{2 \varepsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}}\\
P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \le -\varepsilon \right)
&\le e^{- \frac{2 \varepsilon^2}{ \sum_{i=1}^n(b_i-a_i)^2}}
\end{align}
\tag{4.3}
Proof. Using independence the Hoeffding’s bound we have
e^{t (X_1 + \cdots + X_n - E[X_1 + \cdots + X_n])} =\prod_{i=1}^n e^{t(X_i-E[X_i])} \le \prod_{i=1}^n e^{\frac{t^2 (b_i-a_i)^2}{8}} = e^{ \frac{t^2 \sum_{i}(b_i-a_i)^2}{8}}
and using Chernov bound (for a Gaussian RV with variance \displaystyle \sum_{i}\frac{(b_i-a_i)^2}{4}) gives the first bound in Equation 4.3.
Corollary 4.1 (non-asymptotic confidence interval) Suppose X_1, \cdots, X_N are independent random variables such that a \le X \le b (almost surely) and \mu=E[X_i].
P\left( \mu \in \left[ \frac{S_N}{N}-\varepsilon, \frac{S_N}{N} +\varepsilon \right] \right) \ge 1- 2 e^{-\frac{2 N \varepsilon^2}{(b-a)^2}}
We can rewrite this as
P\left( \mu \in \left[ \frac{S_N}{N}- \sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2N}}, \frac{S_N}{N} +\sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2 N}} \right] \right) \ge 1 -\delta
- Comparison to asymptotic confidence interval, take X_i to be Bernoulliwith unknown \mu=p so we bound the variance by p(1-p) \le \sigma_{max}=\frac{1}{4} (this is what we did in Hoeffding with a=0,b=1. So we have
(\textrm{CLT}) \quad z_{1-\delta} \frac{\sigma_{max}}{\sqrt{N}} \quad \textrm{ vs } \quad \sqrt{2\ln\left(\frac{2}{\delta}\right)} \frac{\sigma_{max}}{\sqrt{N}} \quad (\textrm{Hoeffding})
McDiarmid theorem
A very useful extension of Hoeffding’s applies to suitable function h(X_1, X_2, \cdots, X_n) other than just the sum.
We say that h(x_1, x_2, \cdots, x_n) satisifies the bounded difference property if there exist constants c_k such that for all x_k, x_k'
|h(x_1,\cdots, x_k, \cdots, x_n) - h(x_1,\cdots, x_k', \cdots, x_n)| \le c_k
Theorem 4.3 (McDiarmid Theorem) Suppose X_1, \cdots X_n are independent RVs and f(x_1, \cdots, x_n) satisfies the bounded difference property (almost surely). Then we have
\begin{aligned}
P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \ge \varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \\
P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \le -\varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}}
\end{aligned}
Proof. The proof use the basic properties of conditional expectation.
“Martingale trick”: Let us define random variable Y_k by Y_0=E[f(X_1, \cdots, X_n)] and, for 1\le k \le n,
Y_k =E[h(X_1,\cdots, X_n)|X_1, \cdots, X_k]
Note that Y_n=h(X_1,\cdots, X_n) and also E[Y_k| X_1, \cdots, X_{k-1}]= Y_{k-1} (the martingale property). If we define further Z_k=Y_k-Y_{k-1} then we have the telescoping sum h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] = \sum_{k=1}^n Z_k and E[Z_k|X_1,\cdots,X_{k-1}]=0.
Duplication trick: Let \widehat{X}_k be an independent copy of the random variable X_k. Then by the bounded difference property we have, almost surely,
\left| E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k] - E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_k]\right| \le c_k
\tag{4.4} Furthermore we have
\begin{aligned}
E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]\\
&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_K]
\end{aligned}
\tag{4.5} The first equality holds because X_k and \widehat{X}_{k} are identically distributed and left-hand side is a function of X_1, \cdots, X_{k-1}. The second equality holds because X_k and \widehat{X}_{k} are idependent. Combining Equation 4.4 and Equation 4.5 shows that |Z_k| \le c_k almost surely.
Conditioning and Hoeffding: We have, using conditioning and Hoeffding’s bound from Lemma 4.1
\begin{aligned}
E\left[ e^{t\sum_{k=1}^n Z_k} \right] &= E \left[ E \left[ e^{t\sum_{k=1}^{n-1} Z_k} e^{tZ_n} | X_1, \cdots X_{n-1}\right] \right]
= E \left[ e^{\sum_{k=1}^{n-1} Z_k} E \left[ e^{Z_n} | X_1, \cdots X_{n-1}\right] \right] \le e^{\frac{t^2 c_n^2}{2}} E \left[ e^{t \sum_{k=1}^{n-1} Z_k}\right]
\end{aligned}
where we used that E[Z_n|X_1, \cdots, X_{n-1}]=0 and \textrm{ Var}(Z_n|X_1, \cdots, X_{n-1}) \le c_n^2. By induction we find
E\left[ e^{t h(X_1, \cdots, X_n) -E[h(X_1, \cdots, X_n)]} \right] \le e^{\frac{t^2 \sum_{k=1}^n c_k^2}{2}}
and the rest follows as usual.
Confidence interval for variance estimator
We show how to build confidence interval for the unbiased variance estimator
V_N(X_1, \cdots, X_N)=\frac{1}{N-1} \sum_{k=1}^N\left(X_i - \frac{S_N}{N}\right)^2 = \frac{1}{N-1}\sum_{i=1}^N X_i^2 - \frac{S_N^2}{(N-1)N}
If we change X_1 into \widehat{X}_1 then S_N(X_1, \cdots, X_N ) - S_N(\widehat{X}_1, \cdots, X_N ) = X_1 - \widehat{X}_1 and so
V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N) = \frac{X_1^2 - \widehat{X}_1^2}{N-1} - \frac{X_1 - \widehat{X}_1}{N-1} \left(\frac{S_N(X_1, \cdots, X_N )}{N} + \frac{S_N(\widehat{X}_1, \cdots, X_N )}{N} \right)
So if we assume |X_i|\le c we have the bound
|V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N)| \le \frac{5c^2}{N-1}
and thus by McDiarmid we get
P( V_N - \sigma^2 \ge \varepsilon ) \, \le \, e^{ - \frac{(N-1)^2}{N}\frac{\varepsilon^2}{50c^4}} \,\le\, e^{ -(N-2)\frac{\varepsilon^2}{50c^4}}
and this decay exponentially in N again and can be used for a confidence interval for the variance which is very similar to the one for the mean.
Bernstein bound
In Hoeffding’s bound we use, in an essential way, a bound on the variance. If the variance is small then one should expect the bound to be poor. Bernstein bound, which can be used if we have some a-priori knowledge about the variance, improves.
Theorem 4.4 (Bernstein Bound) Suppose X is a random variable such that |X-E[X]| \le c and \textrm{var}(X) \le \sigma^2. Then
E[e^{t X}]\le e^{t E[X]+ \frac{\sigma^2 t^2}{2(1 -c|t|/3)}} \,.
Proof. We expand the exponential and use that for k\ge 2,
E\left[(X-\mu)^k\right] \le E\left[(X-\mu)^2|X-\mu|^{k-2}\right]\le E[(X-\mu)^2] c^{k-2} \le \sigma^2 c^{k-2}
and get
\begin{aligned}
E\left[e^{t(X-\mu)}\right]=1+ \sum_{k=2}^\infty \frac{t^k}{k!} E[(X-\mu)^k] &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}\frac{2}{k!}(|t|c)^{k-2} \\
&\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}^\infty \left(\frac{|t|c}{3}\right)^{k-2} \quad \textrm{ since } \frac{k}{2!}\ge 3^{k-2}\\
&\le 1 + \frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})} \le e^{\frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})}} \quad \textrm{ since } 1+ x \le e^x
\end{aligned}
and this concludes the proof.
Bernstein confidence interval
To obtain a confidence interval we need to solve an optimization problem: after some straightforward computation we find
\sup_{t \ge 0} \left\{ \varepsilon t - \frac{a t^2}{2(1-bt)} \right\} = \frac{a}{b^2}h\left( \frac{b\epsilon}{a}\right) \quad \textrm{ where } h(u) = 1 +u - \sqrt{1 + 2u}
and we note that the h^{-1}(z)= z + \sqrt{2z}. We obtain the exact same bound for the left tail (by symmetry) and thus
to obtain a confidence interval we have
\delta = 2 e^{-\frac{a}{b^2}h\left( \frac{b\varepsilon}{a}\right)} \iff \varepsilon = b \ln\left( \frac{2}{\delta}\right) + \sqrt{2a \ln\left( \frac{2}{\delta}\right)}
Theorem 4.5 (Bernstein for sum of IID) If X_1, \cdots, X_N are IID random variables with |X_i| \le c and \textrm{Var}(X_i)\le \sigma^2 then
P\left( \mu \in \left[\frac{S_N}{N} - \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) - \sqrt{\frac{2\sigma^2}{N} \ln\left( \frac{2}{\delta}\right)}, \frac{S_N}{N} + \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) + \sqrt{\frac{2\sigma^2}{N} \ln\left( \frac{2}{\delta}\right)} \right] \right) \ge 1 - \delta
Proof. This is Bernstein bound with a=N \sigma^2 and b=c/3 and \varepsilon \to N \varepsilon.
(\textrm{Bernstein}) \quad \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) + \sqrt{2 \ln\left( \frac{2}{\delta}\right)} \frac{\sigma}{\sqrt{N}} \quad \textrm{ versus} \quad \sqrt{2\ln\left(\frac{2}{\delta}\right)} \frac{\sigma_{max}}{\sqrt{N}} \quad (\textrm{Hoeffding})
You should note that this bound can be substantially better than Hoeffding’s bound is \sigma \le \sigma_{max}
Comparing the bounds for Bernoulli
Hoeffdings amount to bounding the variance by \frac14. Bernstein which incorporporates knowledge of (or a bound on) the variance is much better than Hoeffdings as soon as the variance is not too big (0.1 in the epicture below.)
Markov Chains Basics
We introduce the basic definitions necessary to describe Markov chains and provide a first series of examples. For further reading we recommend the books Lawler (2006) and Levin et al. (2017).
Markov chain on a discrete state space
- A stochastic process is a sequence of random variables \{X_0,X_1, X_2,\cdots \} where X_n take values in the space S, called the state space and which we assume for the moment to be discrete (either finite or countable) S=\{i_1, i_2, \cdots\}. Think of the index of X_n as (discrete) “time” and X_n as describing the state of the system at time n.
- To describe a general stochastic process we need the joint pdf of X_0, \cdots, X_n (for any n) (the finite dimensional distributions)
P \left( X_0= i_0, X_1=i_1, \cdots, X_n = i_n \right) = P \left( X_0=i_0\right) P\left( X_1=i_1 \vert X_0=i_0 \right) \cdots P\left( X_n=i_n \vert X_{n-1}=i_{n-1}, \cdots, X_0=i_0 \right)
which we rewrote in terms of conditional probabilities.
- The Markov property is an assumption on the structure of these conditional probabilities, namely that the future depends only on the present and not on the past, that is
P\left( X_n=i_n \vert X_{n-1}=i_{n-1}, \cdots, X_0=i_0 \right)= P(X_n=i_n|X_{n-i}=i_{n-1})
We think here of n-1 as the present, and n the future and interpret P(X_n=j|X_{n-i}=i) as the probability to be in state j at time n given that the state at time n-1 was i.
Therefore for a Markov process we can rewrite the joint pdf as
P \left( X_0= i_0, X_1=i_1, \cdots, X_n = i_n \right) \,=\, P \left( X_0=i_0\right) P\left( X_1=i_1 \vert X_0=i_0 \right) \cdots P\left( X_n=i_n \vert X_{n-1}=i_{n-1} \right)
- We say that a Markov process is time-homogeneous if P(X_n=j|X_{n-1}=i), that is the probability to move from i to j does not depend on the time n.
A time homogeneous Markov chain is specified by
Initial distribution \mu(i) for i \in S where \mu(i)=P(X_0=i). We have
\mu(i)\ge 0 \quad \textrm{ and } \sum_{i \in S} \mu(i) =1
and \mu(i) is called a probability vector .
Transition matrix P(i,j) for i,j \in S where P(i,j)=P(X_n=j|X_{n-1}=i). We have
P(i,j) \ge 0 \quad \textrm{ and } \sum_{j \in S} P(i,j)=1
and P(i,j) is called a transition probabilities
All quantities of interest for the Markov chain can be computed using these two objects. For example we have
P\left\{ X_0 =i_0, X_1=i_1, X_2=i_2\right\} \,=\, \mu(i_0) P(i_0,i_1) P(i_1, i_2) \,.
or
P\left\{ X_2=i \right\} \,=\, \sum_{i_0\in S} \sum_{i_1 \in S} \mu(i_0) P(i_0,i_1) P(i_1, i)
and so on.
- Without loss of generality, we can relabel the states so that S=\{ 1, 2, 3, \cdots, N \} (with N\le\infty). It will be convenient to set
\mu = ( \mu(1), \mu(2), \cdots, \mu(N))
that is \mu is a row vector whose entries are the initial distribution.
Also we will write P for the N \times N matrix whose entries are P(i,j)
P \,=\, \left( \begin{array}{cccc} P(1,1) & P(1,2) & \cdots & P(1,N) \\ P(2,1) & P(2,2) & \cdots & P(2,N) \\
\vdots & \vdots & & \vdots \\
P(N,1) & P(N,2) & \cdots & P(N,N) \end{array}
\right)
Kolmogorov equations
Proposition 5.1 (Kolmogorov equations) For a Markov process with initial distribution \mu and transition probabilities P:
1.The n-step transition probabilities are given by
P \left\{ X_{n}=j \vert X_0={i} \right\} \,=\, P^n(i,j)
where P^n is the matrix product \underbrace{P\cdots P}_{\rm n~times}.
If \mu(i)=P\{X_0=i\} is the distribution of X_0 then
P\{ X_n=i\} \,=\, \mu P^n (i)
is the distribution of X_n.
If f = (f(1), \cdots, f(n))^T is a column vector (think of f as a function f: S \to \mathbb{R}) then we have
P^n f(i) \,=\, E\left[ f(X_n) | X_0=i\right].
Proof. For 1. we use induction and assume the formula is true for n-1. We condition on the state at time n-1 , use the formula P( AB\vert C) = P(A \vert BC) P(B\vert C), the Markov property, to find
\begin{aligned}
P \left( X_{n}=j \vert X_0={i}\right) &= \sum_{k \in S} P \left( X_{n}=j \,, X_{n-1}=k \vert X_0={i} \right) \\
&= \sum_{k \in S} P \left( X_{n}=j \vert X_{n-1}=k \,, X_0={i} \right) P \left( X_{n-1}=k \vert X_0={i} \right) \\
&= \sum_{k \in S} P \left\{ X_{n}=j \vert X_{n-1}=k \right\} P \left\{ X_{n-1}=k \vert X_0={i} \right\} \\
&= \sum_{k \in S} P^{n-1}(i, k) P(k,j) \,=\, P^n(i,j) \,.
\end{aligned}
For 2. note that \mu P is a probability vector since
\sum_{i} \mu P(i) \,=\, \sum_{i} \sum_{j} \mu(j ) P(j,i) \,=\, \sum_{j} \mu(j) \sum_i P(j,i) \,=\, \sum_j \mu(j) \,.
Furthermore by the formula for conditional probabilities part 1.
P\{ X_n=j\} \,=\, \sum_{k \in S} P\{ X_n=j \vert X_0 =k\} P\{ X_0=k\} \,=\, \sum_{k} \mu(k) P^n(k,j) \,=\, \mu P^n (j)\,.
For 3. we have
P^nf (i) \,=\, \sum_{k} P^n(i,k) f(k) \,=\, \sum_{k} f(k) P \left\{ X_{n}=k \vert X_0={i} \right\} \,=\, E[f(X_{n})\,\vert \, X_0=i] \,.
Stationary and limiting distributions
- Basic question: For Markov chain understand the distribution of \{X_n\} for large n, for example we may want to know whether the limit
\lim_{n\to \infty} P\{ X_n=i\} \,=\, \lim_{n \to \infty} \mu P^n(i)
exists or not, whether it depends on the choice of initial distribution \mu and how to compute it.
- A probability vector \pi is called a limiting distribution if the limit \lim_{n \to \infty} \mu P^n = \pi exists. It could well be that the limit depend on the choice of the initial distribution \mu.
- A probability vector \pi is called a stationary distribution if \pi P = \pi.
Limiting distributions are always stationary distributions: Suppose \pi is a limiting distribution and \lim_{n\to \infty} \mu P^n = \pi. Then
\pi P \,=\, (\lim_{n \to \infty} \mu P^n ) P \,=\, \lim_{n \to \infty} \mu P^{n+1} \,=\, \lim_{n \to \infty} \mu P^{n}
\,=\, \pi \,.
and thus \pi is stationary.
In the next section we will derive conditions under which stationary distributions are unique and are limiting distributions.
Examples
Example 5.1 (2-state Markov chain) Suppose S=\{1,2\} with transtion matrix \displaystyle P \,=\, \left( \begin{array}{cc} 1-p & p \\ q & 1-q \end{array} \right) \,.
The equation for the stationary distribution, \pi P = \pi gives
\pi(1) (1-p) + \pi(2) q \,=\, \pi(1) \quad \quad
\pi(1) p + \pi(2) (1-q) \,=\, \pi(2)
that is p \pi(1) = q \pi(2). Normalizing to a probability vector gives \displaystyle \pi = \left( \frac{q}{p+q},\frac{p}{p+q} \right).
We show \pi is a limiting distribution. Set \mu_n \equiv \mu P^n and consider the difference \mu_n -\pi. Using \mu_n(2)=1-\mu_n(1) we get the equation
\begin{aligned}
\mu_n(1) - \pi(1) &= \mu_{n-1} P(1) - \pi(1) = \mu_{n-1}(1) (1-p) + (1- \mu_{n-1}(1)) q - \frac{q}{p+q} \\
& = \mu_{n-1} (1)( 1-p-q) - \frac{q}{p+q}(1 -p -q) \,=\, (1-p-q) ( \mu_{n-1}(1) - \pi(1) )
\end{aligned}
By induction \mu_n(1) - \pi(1) = (1-p-q)^n (\mu_0(1) - \pi(1)). If either p>0 or q>0 then \lim_{n \to \infty} \mu_n(1) = \pi(1) and this implies that \lim_{n\to \infty} \mu_n(2) = \pi(2) as well.
If either p or q does not vanish then \mu_n = \mu P^n converges to a stationary distribution for an arbitrary choice of the initial distribution \mu.
Example 5.2 (Coupon collecting) A company offers toys in breakfast cereal boxes. There are N different toys available and each toy is equally likelky to be found in any cereal box.
Let X_n be the number of distinct toys that you collect after buying n boxes and is natural to set X_0=0. Then X_n is a Markov chain with state space \{0,1,2,\cdots,N\} and it has a simple structure since X_{n} either stays the same of increase by 1 unit.
The transition probabilities are
\begin{aligned}
P(j, j+1) \,&=\, P\{ {\rm ~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{N-j}{N} \\
P(j, j) \,& =\, P\{ {\rm ~no~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{j}{N}
\end{aligned}
The Markov chain X_n will eventually reach the state N and stays there forever (N is called an absorbing state). Let us denote by \tau the (random) finite time \tau it takes to reach the state N. To compute its expectation, E[\tau], let us write
\tau = T_1 + \cdots +T_N \,,
where T_i is the time needed to get a new toy after you have gotten your (i-1)^{th} toy. The T_i’s are independent and have T_i has a geometric distribution with p_i = (N-i)/N. Thus
E[ \tau] \,=\, \sum_{i=1}^N E[ T_i] \,=\, \sum_{i=1}^N \frac{N}{N-i} \,=\, N \sum_{i=1}^N \frac{1}{i} \approx N
\ln (N) \,.
Example 5.3 (Random walk on a graph) Consider an undirected graph G consists with vertex set V and edge set E (each edge e=\{v,w\} is an (unordered) pair of vertices). We say that the vertex v is a neighbor of the vertex w, and write v \sim w, if \{v,w\} is an edge. The degree of a vertex v, denoted \textrm{deg}(v), is the number of neighbor of v.
Given such a graph G=(V,E) the simple random walk on G is the Markov chain with state space V and transition matrix
P(v,w) \,=\, \left\{ \begin{array}{cl} \frac{1}{{\rm deg} (v)} & {\rm if~} w \sim v \\ 0 & {\rm otherwise}
\end{array} \right. \,.
The invariant distribution for the random walk on graph is given by \pi(v) \,=\, \frac{ {\rm deg} (v)} { 2 |E|} where is the number of edges. First note that \sum_{v} \pi(v)=1 since each edge connects two vertices. To show invariance note that
\pi P(v) \,=\, \sum_{w} \pi(w)P(w,v) \,=\, \sum_{w ; w \sim v} \frac{{\rm deg} (w)}{2|E|} \frac{1}{ {\rm deg}(w)}
\,=\, \frac{1}{2|E|} \sum_{w ; w \sim v}1 \,=\, \pi(v) \,.
P\,=\, \left( \begin{array}{ccccc} 0 & \frac13 & \frac13 & \frac 13 & 0 \\ \frac14 & 0 & \frac14 & \frac 14 & \frac14 \\ \frac13 & \frac13 & 0 & 0& \frac 13 \\ \frac13 & \frac13 & 0 & 0 & \frac13 \\ 0 & \frac13 & \frac13 & \frac13 & 0 \\ \end{array} \right) \quad \quad \quad \pi = \left( \frac{3}{16}, \frac{4}{16}, \frac{3}{16}, \frac{3}{16}, \frac{3}{16}\right)
Example 5.4 (Random walk on the hypercube) The d-dimensional hypercube graph Q_d has for vertices the binary d-tuples \mathbf{x}=(x_1, \cdots, x_d) with x_k \in \{0,1\} and two vertices are connected by an edge when they differ in exactly one coordinate (flipping a 0 into 1 or vice-versa).
The simple random walk on Q_d moves from one vertex x=(x_1, \cdots, x_K) by choosing a coordinate j \in \{ 1,2,\cdots, d\} uniformly at random and setting x_j \to (1-x_j).
The degree of each vertex is d, the number of vertices is 2^d and the number of edges is 2^d \frac{d}{2} so \pi(\mathbf{x}) \,=\, \frac{1}{2^d},the uniform distribution on Q_d.
Example 5.5 (Assymetric random walks on {0,1, , N}) State space S=\{0,1,\cdots,N\} and the Markov chain goes up by 1 with probability p and down by 1 with probaility 1-p.
P(j, j+1) =p,\, P(j, j-1) = 1-p \,, \quad \textrm {for } j=1, \cdots, N-1
We can pick different boundary conditions (BC) at 0 and N:
Absorbing BC: P(0,0)=1, P(N,N)=1
Reflecting BC: P(0,1)=1, P(N,N-1)=1
Partially reflecting BC: P(0,0)=(1-p)\,, P(0,1)=p P(N,N-1)=(1-p)\,, P(N,N)=p
Periodic BC: P(0,1)=p\,, P(0, N)=(1-p)\,, \quad P(N, 0) =p\,, P(N, N-1)= (1-p)
Example 5.6 (Ehrenfest urn model) Suppose d balls are distributed among two urns, urn A and urn B. At each move one ball is selected uniformly at random among the d balls and is transferred from its current urn to the other urn. If X_n is the number of balls in urn A then the state space is S=\{0, 1, \cdots, d\} and the transition probabilities
P(j,j+1) \,=\, \frac{d-j}{d}\,, \quad P(j,j-1) \,=\, \frac{j}{d} \,.
We show that the invariant distribution is ninomial with paramters (d,\frac{1}{2}), that is \pi(j) \,=\, {d \choose j} \frac{1}{2^d}.
\begin{aligned}
\pi P (j) &= \sum_{k} \pi(k) P(k,j) = \pi(j-1) P(j-1,j) + \pi(j+1) P(j+1, j) \\
&= \frac{1}{2^d}\left[ {d \choose j-1} \frac{d - (j-1)}{d} + {d \choose j+1} \frac{j+1}{d} \right]
= {d \choose j} \frac{1}{2^d} \,.
\end{aligned}
This Markov chain is closely related to the simple random walk Y_n on the hypercube Q_d. Indeed selecting randomly one of the d balls and moving it the other urn is equivalent to selecting a random coordinate y_k of \mathbf{y}=(y_1, \cdots, y_d) and changing it to 1-y_k. If we denote by j= |\mathbf{y}|=y_1+ \cdots + y_d to be the number of 1s in \mathbf{y}. Then
P(X_n=j+1|X_n=j) = P(\textrm{ choose } k \textrm{ such that } y_k=0) = \frac{d-j}{d}
Ergodic theory of finite state space Markov chains
We present the basic convergence theory for finite state Markov chains and introduce the concept of irreducubility and period of a Markov chain. This is very classical stuff, see e.g. Lawler (2006) and Levin et al. (2017).
Existence of stationary distributions
We first show that stationary distributions always exist for finite state Markov chains. This will not be the case if the state space is countable.
Theorem 6.1 Let X_n be a Markov chain on a finite state space S. Then X_n has at least one stationary distribution.
Proof. Boltzano Weierstrass theorem asserts that any bounded sequences \{\mathbf{x}_k\} in \mathbb{R^d} has a convergence subsequence.
Choose \mu_0 to be an arbitrary initial distribution and let \mu_n = \mu P^n. Apply Boltzano Weierstrass to the sequence \mu_n will not work but consider instead the time averages
\nu_n \,=\, \frac{\mu + \mu P + \cdots \mu P^{n-1}}{n}
The sequence \nu_n is bounded since \mu_n and \nu_n are probability vector. Therefore there exists a convergent subsequence \nu_{n_k} with \nu_{n_k}=\pi and \pi is a probability vector as well. We show that \pi is a stationary distribution. Note
\nu_n P - \nu_n \,=\, \frac{ \mu P + \cdots \mu P^{n} - \mu - \cdots -\mu P^{n-1}}{n} \,=\,
\frac{\mu P^n - \mu}{n} \quad \textrm{ and thus } \quad \left\vert \nu_n P(j) - \nu_n(j) \right\vert \,\le \, \frac{1}{n}
To conclude we note that
| \pi P(j) - \pi(j)| \,=\, \lim_{k \to \infty} | \nu_{n_k}P (j) - \nu_{n_k}(j) | \,\le\, \lim_{k \to \infty} \frac{1}{n_k} \,=\, 0 \,.
and thus \pi P = \pi, that is \pi is a stationary distribution.
Uniqueness of stationary distributions
- It is easy to build examples of Markov chain with multiple stationary distributions.
P \,=\, \left( \begin{array}{cccc} \frac{1}{2} & \frac{1}{2} & 0 & 0 \\ \frac13 & \frac23 & 0 & 0 \\
0 & 0 & \frac16 & \frac56 \\
0 & 0 & \frac45 & \frac15 \end{array} \right) \quad \quad \pi_1=\left( \frac35 , \frac 25 ,0,0\right) \textrm{ and } \pi_2= \left( 0, 0 ,\frac{25}{49}, \frac{24}{49}\right)
\quad \textrm{ are stationary}
If the Markov chain starts in state 1 or 2 it will only visits states 1 and 2 in the future and so we can build a stationary distribution (by Theorem 6.1 using an initial \mu_0 which vanishes on 3 and 4 ) which will vanish on the state 3 and 4.
Theorem 6.2 (Uniqueness of stationary distributions) Suppose X_n is an irreducible Markov chain with finite state space S.
If \pi is a stationary distribution then \pi(j)>0 for all j\in S,
Suppose h is a column vector such that Ph=h then h=c(1,1,\cdots,1)^T is a constant vector.
The Markov chain X_n has a unique stationary distribution \pi.
Proof. For 1. if \pi is stationary distribution then \pi(i)>0 for at least one i. If j is such that i \rightsquigarrow j then there exists a time r such that P^r(i,j) >0 and thus
\pi(j) =\pi P^r(j)\,=\, \sum_{k} \pi(k) P^r(k,j) \,\ge \, \pi(i) P^r(i,j) >0 \,.
Since X_n is irreducible, \pi(j)>0 for any j \in S.
For 2, suppose that Ph=h and i_0 such that h(i_0)= \max_{i\in S} h(i)\equiv M. If h is not a constant vector there exists j with i_0 \rightsquigarrow j but h(j) < M and P^r(i_0,j>0)>0. Since P^rh=h,
M= h(i_0) \,=\, P^r h(i_0) \,=\, P^r( i_0, j) \underbrace{h(j)}_{< M} + \sum_{l \not= i_0} P^n(i_0,l) \underbrace{h(l)}_{\le M} < M\sum_{l} P^(i_0,l)=M\,,
and this is a contradiction.
For 3. part 2. shows that the geometric multiplicity of the eigenvalue 1 matrix P is equal to 1 and thus so is the geometric multiplicity of the eigenvalue 1 for the adjoint P^T and thus P has at most one left eigenvector \pi.
Convergence to stationary distribution
Question: If we have a unique stationary distribution is it correct that \lim_{n \to \infty} \mu P^n = \pi?
Without further assumption the answer is NO, because of possible periodic behavior. Consider for example the random walk on \{1, \cdots, 2N\} with, for example, periodic boundary conditions. The stationary distribution is uniform \pi= \left(\frac{1}{2N}, \cdots, \frac{1}{2N}\right). But if X_0 is even then X_1 is odd and then alternating periodically between odd and even positions. For example for
P \,=\, \left( \begin{array}{cccc} 0 & 1/4 & 0 & 3/4 \\ 3/4 & 0 & 1/4 & 0 \\
0 & 3/4 & 0 & 1/4 \\
1/4 & 0 & 3/4 & 0 \end{array} \right) \quad \quad \pi = \left( \frac14,\frac14 ,\frac14,\frac14 \right)
\textrm{ is stationary}
and
P^{10}=\begin{pmatrix}
0.4995 & 0 & 0.5004 & 0 \\
0 & 0.4995 & 0 & 0.5004 \\
0.5004 & 0 & 0.4995 & 0 \\
0 & 0.5004 & 0 & 0.4995 \\
\end{pmatrix}
\quad \quad
P^{11}=
\begin{pmatrix}
0 & 0.5002 & 0 & 0.4997 \\
0.4997 & 0 & 0.5002 & 0 \\
0 & 0.4997 & 0 & 0.5002 \\
0.5002 & 0 & 0.4997 & 0 \\
\end{pmatrix}
and P^n(i,j) does not converge, oscillates, asymptotically between 0, and 1/2.
- A Markov chain X_n is called irreducible and aperiodic if there exists an integer n such that P^n(i,j) > 0 for all pair i,j in S. (The meaning of the terminology will become clearer later on.)
Theorem 6.3 (Doeblin’s Theorem) Suppose X_n be an irreducible and aperiodic Markov chain on the finite state space S with stationary distribution \pi. There exists a constant C>0 and number \alpha with 0\le \alpha< 1 such that for any initial distribution \mu we have
|\mu P^n(j) - \pi(j)| \,\le\, C \alpha^n \,,
\tag{6.1} i.e., the distribution of X_n converges, exponentially fast, to \pi.
Proof. Since the Markov chain is irreducible and aperiodic we can find an integer r such that P^r has strictly positive entries. Let \Pi be the stochastic matrix
\Pi \,=\, \left( \begin{array}{cccc} \pi(1)& \pi(2)& \cdots& \pi(N) \\
\pi(1)& \pi(2)& \cdots& \pi(N) \\
\vdots& \vdots & &\vdots
\\
\pi(1) &\pi(2)& \cdots& \pi(N)
\end{array}
\right)
where every row is the stationary distribution \pi. Note that this corresponds to independent sampling from the stationary distribution.
Since all elements of P^r(i,j) are strictly positive we can pick \delta >0 sufficiently small such that for all i,j
P^r(i,j) \,\ge \, \delta\,\Pi(i,j)=\delta \pi(j) \,.
\tag{6.2}
Let us set \theta = 1-\delta and by Equation 6.2 we define a stochastic matrix Q through the equation
P^r = (1-\theta) \Pi + \theta Q \,.
Note the following facts:
Using these two properties we show, by induction, that any integer k\ge 1, P^{kr} \,=\, (1 - \theta^k) \Pi + \theta^k Q^k. This is true for k=1 and so let us assume this to be true for k. We have
\begin{aligned}
P^{r(k+1)} = P^{rk} P^r &= \left[ (1 - \theta^k) \Pi + \theta^k Q^k \right] P^r
= (1 - \theta^k) \Pi P^r + \theta^k Q^k [(1-\theta) \Pi + \theta Q] \\
&= (1 - \theta^k) \Pi + \theta^k(1-\theta) \Pi + \theta^{k+1} Q^{k+1} = (1 - \theta^{k+1}) \Pi + \theta^{k+1} Q^{k+1}
\end{aligned}
and this concludes the induction step.
From this we conclude that P^{rk} \to \Pi as k \to \infty. We write any integer n as n=k r +l where 0 \le l < r. We have then
P^n = P^{kr} P^l = [(1 - \theta^k) \Pi + \theta^k Q^k] P^l = \Pi + \theta^{k} \left[ Q^k P^l - \Pi \right]
and thus
|P^n(i,j) - \pi(j)| \,=\, \theta^k |Q^k P^l(i,j) - \Pi(i,j)| \le \theta^k \le \frac{1}{\theta} (\theta^{1/r})^n \equiv C\alpha^n \,.
The same holds for any initial distribution multiplying by \mu(i) and summing.
The period of a Markov chain
As we have seen before Markov chain can exhibit periodic behavior and circle around various part of the state space.
- The period of state j, \textrm{per}(j), is the greatest common divisor of the set
{\mathcal T}(j) \,=\, \left\{ n\ge 1\,,\, P^n(j,j)>0 \right\}
that is {\mathcal T}(j) is set of all times at which the chain can return to j with positive probability.
For example for random walks with periodic boundary conditions the period is 2 is the number of sites is even and 1 if the number of sites is odd.
The set {\mathcal T}(j) is closed under addition:
n,m \in {\mathcal T}(j) \implies m+n \in {\mathcal T}(j)
which holds true since P^{n+m}(j,j) \ge P^{n}(j,j)P^{m}(j,j).
By definition if \textrm{per}(j)=d then
{\mathcal T}(j) \subset \{ 0,d,2d, \cdots\}
Since it is closed under addition a fact from number theory (see exercise) implies that there exists k such that
\{ kd, (k+1)d, (k+2)d \cdots \} \subset {\mathcal T}(j) \quad \textrm { or equivalently } \quad P^{nd}(j,j)>0 \textrm{ for all } n\ge k
Lemma 6.1 If i \leftrightsquigarrow j then \textrm{per}(i)=\textrm{per}(j).
Proof. Suppose \textrm{per}(j)=d and i \leftrightsquigarrow j. There exist integers m and m such that P^m(i,j)>0 and P^n(j,i)>0 and thus
P^{m+n}(i,i)>0 \textrm{ and }P^{m+n}(j,j)>0
which means that m+n \in {\mathcal T}(i) \cap {\mathcal T}(j) and so m+n=kd for some integer k.
Suppose l \in {\mathcal T}(i) then
P^{n+m+l}(j,j) \ge P^n(j,i) P^l(i,i) P^m(i,j) > 0
and n+m+l \in {\mathcal T}(j). Therefore d divides l since it divides both n+m+l and n+m and this shows that \textrm{per}(j) \le \textrm{per}(i). Reversing the roles of i and j we find \textrm{per}(j) = \textrm{per}(i). \quad \blacksquare
Consequences:
If X_n is irreducible then all states have the same period and we can define the period of the irreducible Markov chain X_n .
If X_n is irreducible and has period 1 (also called aperiodic) and S is finite then P^{n}(i,j) >0 for all suficiently large n (compare with our earlier definition of irreducible and paeriodic).
Decomposing the state space S = G_1 \cup G_2 \cup \cdots G_d:
Start with a state i\in S and let i \in G_1.
For all j such that P(i,j)>0 let j \in G_2.
For all j\in G_2 and all k such that P(j,k)>0 let k\in G_3.
After assigning set to G_d assign the next states to G_1.
Repeat until all states have been assigned. If |S| is finite this takes at most |S| steps.
By construction since the period j we get a partition of S into d distinct subsets G_1, \cdots, G_d and
P(i,j)> 0 \implies i \in G_k \textrm{ and } j \in G_{k+1 (\textrm{mod} d)}
For example
P=
\begin{matrix}
1\\2\\3\\4\\5\\6\\7
\end{matrix}
\begin{pmatrix}
0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 \\
1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & .2 & 0 & 0 & .8 & 0 \\
.3 & .7 & 0 & 0 & 0 & 0 & 0 \\
.2 & .8 & 0 & 0 & 0 & 0 & 1 \\
0 & 0 & 0 & 0 & .7 & .3 & 0 \\
\end{pmatrix}
Iteratively we find
1 \rightsquigarrow 4 \rightsquigarrow 3,6 \rightsquigarrow 1,2 \rightsquigarrow 4,7 \rightsquigarrow 3,5,6 \rightsquigarrow 1,2
\rightsquigarrow 4,7
and thus the Markov chain is irreducible with period 3 and so we have
S=\{1,2\} \cup \{4,7\} \cup \{3,5,6\}
Relabeling the state the transition matrix P takes the following block matrix form
P\,=\,
\begin{matrix}
G_1 \\
G_2 \\
\vdots\\
G_{d-1}\\
G_d
\end{matrix}
\begin{pmatrix}
0 & P_{G_1 G_2} & 0 & \cdots &0 \\
0 & 0 &P_{G_2 G_3} & \cdots &0 \\
\vdots & \vdots & & \vdots & \vdots \\
0 & 0 & \cdots & 0 &P_{G_{d-1} G_d} \\
P_{G_dG_1}& 0 & \cdots & 0 & 0
\end{pmatrix}
where P_{G_l G_{l+1}} describe the transition between states in G_l and G_{l+1}.
Putting the matrix P to the power d we find the block diagonal form
P^d\,=\,
\begin{matrix}
G_1 \\
G_2 \\
\vdots\\
G_d
\end{matrix}
\begin{pmatrix}
Q_{G_1} & 0 & 0 & \cdots &0 \\
0 & Q_{G_2} & 0 & \cdots & 0 \\
\vdots & \vdots & & & \vdots \\
0 & 0 & \cdots & 0 & Q_{G_d}
\end{pmatrix}
Theorem 6.4 If X_n is an irreducible Markov chain on a finite state space with stationary distribution \pi then we have
\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} P^k(i,j) \,=\, \pi(j) \,.
Proof. The matrix Q_{G_l}, restricted to state space G_l governs the evolution of an irreducible aperiodic Markov chain and thus, by Theorem 6.3, we have
\lim_{n \to \infty} P^{nd}(i,j) \,=\,\lim_{n \to \infty} Q^n_{G_l}(i,j) = \pi_{G_l}(j)\,\quad \textrm{ for } i \in G_l, j \in G_l
where \pi_{G_l} is the stationary distribution for Q_{G_l}. Morever P^{l}(i,j)=0 if l\not=nd for some n.
If i \in G_{l-1} and j \in G_{l} then
\lim_{n \to \infty} P^{nd+1}(i,j) \,=\, \lim_{n\to \infty} \sum_{k \in G_{l}} P(ik) P^{nd}(k,j) \,=\, \sum_{k \in G_{l}} P(ik) \pi_{G_l}(j) \,=\,
\pi_{G_l}(j)\,,
and P^{m}(i,j)=0 if m\not=nd+1 for some n. Similarly if i \in G_{l-r} and j \in G_{l} we have
\lim_{n \to \infty} P^{nd+r}(i,j) \,=\, \pi_{G_l}(j)\,.
and P^{m}(i,j)=0 if m\not=nd+r for some n.
The sequence P^n(i,j) is asymptotically periodic where d-1 successive 0 alternates with a number converging to \pi_{G_l}(j) and thus if we define \pi \equiv \frac{1}{d} ( \pi_{G_1}, \cdots, \pi_{G_d}) then \pi is normalized, stationary and
\lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n P^k(i,j) \,=\, \pi(k)
since the time spend in state j is asymptotically equal to \frac{1}{d} \pi_{G_l}(j).
Hitting and return times
We now have good understanding the asymptotic behavior of P^n(i,j) or \mu P^n as n \to \infty. When simulating or observing a Markov chain we usually observe a single sequence
X_0, X_1, X_2, \cdots
and our next goal is to derive a law large numbers for this sequence of random variables. To do this it will be useful to track how often the Markov chain visits some state.
Notations:
- Hitting time for the state j, \sigma(j), is the first time the Markov chain visits the state j.
\sigma(j) = \inf\{ n \ge 0\,;\, X_n=j \} \quad \textrm{hitting time }
- Return time for the state j, \tau(j), is the first time the Markov chain returns to the state j.
\tau(j) = \inf\{ n \ge 1\,;\, X_n=j \} \quad \textrm{return time }
Hitting time and return are closely related and are distinct only from the way they consider what happen at time 0 (both are useful later on). They are example of stopping time and have the crucial property that to decide the events \tau(j)=k only depends on X_0, \cdots X_k. In particular we have the strong Markov property
P( X_{\tau(j)+1}=l| \tau(j)=k, X_0=i_0, \cdots, X_k=i_k) = P( X_1=l| X_k=j)
which means that the Markov chains starts afresh after the random return time.
Irreducibility means that starting from i the Markov chain will reach j and thus for any i we have
P\{\tau(j) <\infty |X_0=i\} >0
If the state space is finite then we have something much stronger. Not only we have P\{\tau(j) <\infty |X_0=i\} =1, the expectation of \tau(j) is actually finite.
Lemma 6.2 For an irreducible Markov chain on the finite state space S the expected return time E[\sigma(j)|X_0=i] < \infty
Proof. By irreducbility the Markov chain can reach the state j from any state i in a finite time. Since S is finite this time is uniformly bounded. So there exist a time m and \varepsilon>0 such that P^l(i,j)\ge \epsilon for some l \le m uniformly in i \in S. Using the strong Markov property this implies that
P\{\sigma(j) > k m|X_0=i\} \le (1-\varepsilon) P \{ \sigma(j) > (k-1) m |X_0=i\}\le \cdots \le (1-\varepsilon)^k
Therefore, using that P\{\sigma(j) > n|X_0=i\} is a decreasing function of n
E[\sigma(j)|X_0=i] = \sum_{n \ge 0} P\{\sigma(j) > n|X_0=i\} \le m \sum_{k \ge 0} P\{\sigma(j) > km|X_0=i\}\le m \sum_{k \ge 0}
(1-\varepsilon)^k < \infty \quad \blacksquare
- Note that if S is infinite this is not necessarily true and this will lead to the concepts of transience, recurrence and positive recurrence.
Strong law of large numbers for Markov chains
We will also need the random variable Y_n(j) which counts the number of visits to starte j up to time n that is
Y_n(j) \equiv \sum_{k=0}^{n-1} \mathbf{1}_{\{X_k=j\}} \quad = \textrm{ the number of visits to the state } j \textrm{ up to time } n
Since E \left[ \mathbf{1}_{\{X_k=j\}} \vert X_0=i \right] =P^k(i,j) and we have
\frac{1}{n}E[Y_n(j)\vert X_0=i]= \frac{1}{n}\sum_{k=0}^{n-1} P^k(i,j)
and thus by Theorem 6.3 or Theorem 6.4
\pi(j) \,=\, \lim_{n\to \infty} \frac{1}{n}E\left[ \sum_{k=1}^{n} {\bf 1}_{\{X_k=j\}} \right] \quad = \textrm{ the proportion of time spent in state } j \textrm{ in the long run }
This is an important intuition
\pi(j) = \textrm{expected fraction of time that the Markov chain spends in state } j \textrm{ in the long run}.
The next results strengthen that intepretation.
Theorem 6.5 (Ergodic Theorem for Markov chain) Let X_n be an irreducible aperiodic Markov chain with arbitrary initial condition \mu, then, with probability 1 we have
\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_n=j\}} \,=\, \pi(j) \,,
this is the ergodic theorem for Markov chain.
Moreover if \tau(j) is the first return time to j we have the Kac’s formula
\pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,.
Proof. For a Markov chain starting in some state i consider the successive return to the state j. By the strong Markov property, one the Markov chain has reached j it starts a fresh. So the time when the Markov chain returns to state j for the k^{th} time is
T_k(j)\,=\, \tau_1(j) + \cdots+ \tau_k(j) \,,
where \tau_l(j) are independent copies of the return times \tau(j). For l\ge 2 \tau_l^{(j)} is conditioned on starting at j while for l=1 it depends on the initial condition. Note that by the strong LLN for IID random variables we have
\lim_{k\to \infty}\frac{T_k(j)}{k}\,=\, \lim_{k \to \infty} \frac{1}{k} \left( \tau_1(j) + \cdots+ \tau_k(j) \right) \,=\, E[\tau(j)|X_0=j] \,.
Note further that if the total number of visits to state j up to time is k, it means that that we have returned exactyl k times and so
\left\{ Y_n(j)=k \right\} =\left\{ T_{k}(j) \le n < T_{k+1}(j) \right\}
So we have
\frac{ T_{Y_n(j)}(j) } {Y_n(j)} < \frac{n}{Y_n(j)} \le \frac{T_{Y_n(j)+1}(j)}{Y_n(j)+1} \frac{Y_n(j)+1}{ Y_n(j)}
As n \to \infty Y_n(j) \to \infty too and taking n \to \infty both extremes of the inequality converge to E[\tau(j) \vert X_0=j] with probability 1 and thus we conclude that, with probability 1,
\lim_{n \to \infty} \frac{Y_n(j)}{n} \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,.
On the other hand we know that
\lim_{n \to \infty} \frac{E[Y_n(j)]}{n} \,=\, \pi(j) \,,
and thus
\pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,.
Putting all the pieces together gives the theorem. \quad \blacksquare.
Transient behavior of Markov chains
Decomposition of state space
We drop the assumption of irreducibility and develop a number of tool to study the transient behavior of Markov chains.
- The communication relation i \leftrightsquigarrow j is an equivalence relation. We use the convention P^0 = I (the identity matrix) and then also that i \leftrightsquigarrow i (i communicates with itself P^0(i,i)=1).
It is reflexive : i \leftrightsquigarrow i.
It is symmetric: i \leftrightsquigarrow j implies j \leftrightsquigarrow i.
It is transitive: i \leftrightsquigarrow j and j \leftrightsquigarrow l implies i \leftrightsquigarrow l.
- Using this equivalence relation we can decompose the state space S into mutually disjoint communication classes
S = C_1 \cup C_2 \cup \cdots \cup C_M \,.
There are two kinds of communication classes, transient classes and closed classes:
A class C is called transient if there exists i\in C and j \in S \setminus C with i \rightsquigarrow j.
This means it is possible to exit the class C and never come, since it could come back it would mean j \rightsquigarrow k for k \in C and thus j \in C.
A class C is called closed classes if it is not transient that is, for any pair i\in C and j \in S \setminus C we have i \not\rightsquigarrow j.
Clearly it is impossible to exit a closed class.
The next result state if you start in a finite transient class you will always eventually exit it (this may not be true any more S is infinite).
Lemma 7.1 Suppose S is finite and X_0 \in C where C is a transient class. Then X_n exits C after a finite time with probability 1. As a consequence for i, j \in C we have
\lim_{n\to \infty} P^n(i,j) =0 \,.
Proof. By irreducibility X_n can exit C starting from any i\in C after a finite time. Since C is finite, there exists
k and \theta < 1 such that have
P \{X _k \in C | X_0=i \} \le \theta \,\, \textrm{ for all } i \in C \,.
Using the (strong) Markov property this implies that P \{X _{nk} \in C | X_0=i \} \le \theta^n and so so the probability to stay in transient class goes to 0 as time goes by. If i and j both belong to the transient class C this can be re-expressed as
\lim_{n\to \infty} P^n(i, j) \,=\, 0 \,.
\quad \blacksquare.
Communication diagram
To figure out the class structure it is convenient to build a directed graph where the vertices are the state and each directed edges corespond to a pair (i,j) with P(i,j)>0.
For example
P=
\begin{matrix}
1 \\ 2 \\ 3\\4\\5\\6\\7\\8
\end{matrix}
\begin{pmatrix}
0 & .4 & .6 & 0 & 0 & 0 & 0 &0 \\
.3 & .7 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & .5 & 0 & 0 & 0 & .5\\
0 & 0 & .2 & 0 & .8 & 0 & 0 &0 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 &0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 &0 \\
0 & 0 & 0 & 0 & 0 & .3 & 0 &.7 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 &0
\end{pmatrix}
There are 4 classes: T_1=\{1,2\}, T_2=\{3,4\}, T_3=\{5\} are all transient and C=\{6,7,8\} which is closed.
Decomposition of reducible Markov chains
Markov chain with recurrent classes R_1, \cdots R_L and transient classes denoted by T_1, \cdots T_K (set T= T_1 \cup \cdots T_K). After reordering the states the transition matrix has the block form
P\,=\,
\begin{matrix}
R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T
\end{matrix}
\begin{pmatrix}
P_1 & & & & &\\
& P_2 & & 0 & &\\
& & P_3 & & & \\
& 0 & & \ddots & &\\
& & & &P_L &\\
& & S & & & Q
\end{pmatrix}
\tag{7.1}
where P_l gives the transition probabilities within the class R_l, Q the transition within the transient classes and S\not= 0 the transition from the transient classes into the recurrent classes.
It is easy to see that P^n has the form
P^n\,=\,
\begin{matrix}
R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T
\end{matrix}
\begin{pmatrix}
P_1^n & & & & &\\
& P_2^n & & 0 & &\\
& & P_3^n & & & \\
& 0 & & \ddots & &\\
& & & &P_L^n &\\
& & S_n & & & Q^n
\end{pmatrix}
for some matrix S_n.
Example: Consider the random walk with absorbing boundary conditions (see Example 5.5). There are three classes, 2 closed ones
\{0\}, \{N\} and 1 transient \{1, \cdots, N-1\} for example with N=5 we have
P\,=\,
\begin{matrix}
0 \\ 5 \\ 1 \\ 2 \\ 3 \\ 4
\end{matrix}
\begin{pmatrix}
1 & 0 & 0 & 0 &0 & 0\\
0 & 1 & 0 & 0 & 0& 0\\
1/2 & 0 & 0 & 1/2 & 0 & 0 \\
0 & 0 & 1/2 & 0 & 1/2 & 0 \\
0 & 0 & 0 & 1/2 & 0 & 1/2\\
0 & 1/2 & 0 & 0 & 1/2 & 0
\end{pmatrix}
To get a feel of what’s going on we find (rounded to 4 digit precision)
P^{20} = \begin{pmatrix}
1 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 \\
0.7942 & 0.1953 & 0.004 & 0 & 0.0065 & 0 \\
0.5924 & 0.3907 & 0 & 0.0104 & 0 & 0.0065 \\
0.3907 & 0.5924 & 0.0065 & 0 & 0.0104 & 0 \\
0.1953 & 0.7942 & 0 & 0.0065 & 0 & 0.004 \\
\end{pmatrix}\,, \quad
P^{50}=\begin{pmatrix}
1 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 \\
0.8 & 0.2 & 0 & 0 & 0 & 0 \\
0.6 & 0.4 & 0 & 0 & 0 & 0 \\
0.4 & 0.6 & 0 & 0 & 0 & 0 \\
0.2 & 0.8 & 0 & 0 & 0 & 0 \\
\end{pmatrix}
As expected starting in transient class the Markoveventually exit it to reach here of one the two recurrent class. From P^50 one can read that starting in, say state 2 with probability .6 the Markov chain will end up at 0 and with probability .4 the Markov chain will end up in 5. We will learn how to compute these probability later on but we study first how long it takes to exit the closed class.
Absorption time
Starting in a recurrent class R_l the Markov chain stays in R_l forever and its behavior is dictated entirely by the irreducible matrix P_l with state space R_l.
Starting from a transient class the Markov chain will eventually exit the class T and is absorbed into some recurrent class. One basic question is: What is the expected time until absorption?
We define the absorption time
\tau_{\textrm{abs}} = \min\{ n \ge 0 \,;\, X_n \notin T \}
Theorem 7.1 (Expected time until absorption) Let j be a transient state and let \tau_{\textrm{abs}} to be the time until the Markov chain reaches some closed class. Then we have
E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,.
where
M=(I-Q)^{-1} = 1 + Q + Q^2 + \cdots
and Q is from the decomposition Equation 7.1.
Proof. From Lemma 7.1 we know that Q^n(i,j) \to 0 and so all eigenvalues of Q have absolute values strictly less than 1. Therefore I-Q is an invertible matrix and we can define
M = (I-Q)^{-1} = I +Q +Q^2 + Q^3 + \cdots
The second equality is the geometric series for matrices which follows from the identity
(I+ Q + \cdots Q^n)(1-Q) = I -Q^{n+1} \,.
To give a probabilistic interpretation to the matrix M we introduce the random variable
Y(i) \,=\, \sum_{n=0}^\infty 1_{\{X_n=i\}} \quad = \textrm{ total number of visits to state } i
If i is transient by Lemma 7.1 Y(i) < \infty with probability 1. If j is another transient state we have
E[ Y(i) \,\vert\, X_0=j]= \,E\left[ \sum_{n=0}^\infty {\bf 1}_{\{X_n=i\}} \,\vert\, X_0=j\right]
\,=\,\sum_{n=0}^\infty P\left\{ X_n=i \,\vert\, X_0=j\right\}
\,=\,\sum_{n=0}^\infty Q^n(i,j) = M(i,j)
That is M(j,i) is simply the expected number of visits to i if X_0=j and thus, summing over all the transient states we obtain
E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,. \quad \blacksquare
Hitting time in irreducible Markov chain
Suppose X_n is irreducible and we are interested in computing the expected time to reach one state from another state that is E[ \tau(i)| X_0=j] \textrm{ for } j \not= i.
If i=j then from Theorem 6.5 that E[ \tau(i)| X_0=i]=\pi(i)^{-1} where \pi is the stationary distribution.
If i \not= j we use the following idea. Relabel the states such the first state is i and modify P such that i is an absorbing state
P =
\begin{pmatrix}
P(i,i) & R \\ S & Q
\end{pmatrix}
\quad \quad \longrightarrow \quad \quad
\widetilde{P} =
\begin{pmatrix}
1 & 0 \\ S & Q
\end{pmatrix}
\tag{7.2}
The fact that P is irreducible implies that S\setminus \{i\} is a transient class for the modified transition matrix \widetilde{P}. The return time \tau(i) starting from i\not= j for the Markov chain with transition P is exactly the same as the absorption time for the Markov chain with transtion \widetilde{P}. Indeed the hitting time does not depend on the submatrix R. Therefore from Theorem 7.1 we obtain immediately
Theorem 7.2 (Expected hitting time in irreducible Markov chain) Let X_n be an irreducible Markov chain. For i \not= j we
E[\tau^(i) |X_0=j] \,=\, \sum_{l\in T} M(j,l) \,.
where M=(I-Q)^{-1} and Q is given in Equation 7.2 and is obtained by deleting the i^{th} row and i^{th} column from P.
Examples
Example(continued): Random walk with absorbing boundary conditions on \{0,1,\cdots,5\}. We get
Q\,=\, \begin{pmatrix}
0 & 1/2 & 0 & 0 \\
1/2 & 0 & 1/2 & 0 \\
0 & 1/2 & 0 & 1/2\\
0 & 0 & 1/2 & 0
\end{pmatrix} \quad \implies
\quad
M=(I-Q)^{-1}
\,=\,
\begin{pmatrix}
1.6 & 1.2 & 0.8 & 0.4 \\
1.2 & 2.4 & 1.6 & 0.8 \\
0.8 & 1.6 & 2.4 & 1.2\\
0.4 & .8 & 1.2 & 1.6
\end{pmatrix}
and thus the expected times until absorption are 4 for states 1 and 4 and 6 for states 2 and 3.
Example: Random walk with reflecting boundary conditions (see Example 5.5) with N=5 and we compute E[ \tau(0) | X_0=i]. The stationary distribution is \pi = \left(\frac{1}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{1}{10}\right) so E[ \tau(0) | X_0=0]=10. For i\not=0 we delete the first row and column from P and have
Q\,=\, \left( \begin{array}{ccccc}
0 & 1/2 & 0 & 0 &0 \\
1/2 & 0 & 1/2 & 0 &0 \\
0 & 1/2 & 0 & 1/2 &0 \\
0 & 0 & 1/2 & 0 &1/2 \\
0 & 0 & 0 & 1 &0
\end{array}
\right) \,,
\quad
M=(I-Q)^{-1}
\,=\, \left( \begin{array}{ccccc}
2 & 2 & 2 & 2 & 1\\
2 & 4 & 4 & 4 &2 \\
2 & 4 & 6 & 6& 3\\
2 & 4 & 6 & 8 & 4 \\
2 & 4 & 6 & 8 & 5
\end{array} \right)
and so the expected return times to 1 are 10, 9, 16, 21, 24, 25 respectively.
Absorption probabilities
If X_0 = i \in T belongs to some transient class and if there are more than one closed classes, say R_1, R_2, \cdots, R_L then the Markov may be absorbed in distinct closed class and so we wisht to compute the probabilities
P\{ X_n \,\,\textrm{ reaches class } R_l \,|\, X_0=i \}
Without loss of generality we can assume that each closed class is an absorbing state r_1, \cdots r_L (we can always collapse a closed class into a absorbing state since it does not matter which state in the recurrent class we first visit). We denote the transient states by t_1, \cdots, t_M and upon reordering the state the transition matrix has the form
P\,=\,
\begin{matrix}
r_1 \\ \vdots \\ r_l \\ t_1 \\ \vdots \\ t_M
\end{matrix}
\begin{pmatrix}
& & & & & \\
& I & & &0& \\
& & & & & \\
& & & & & \\
& S & & &Q & \\
& & & & & \\
\end{pmatrix}
\tag{7.3}
We wish to compute
A( t_i, r_l) = P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\}\,.
and it will be convenient to set A(r_l, r_l)=1 and A(r_k, r_k)=0 if k\not= l.
Theorem 7.3 (Absorption probabilities in closed classes) For a Markov chain with transition matrix Equation 7.3 where the states t_1, \cdots t_M are transient we have
P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\} = A(t_i, r_l) \quad \textrm{ with } \quad A = (I -Q)^{-1} S
Proof. We condition on the first step of the Markov chain to obtain
\begin{aligned}
A( t_i, r_l) &= P \left\{ X_n = r_l {\rm ~eventually~} \vert X_0= t_i\right\} \\
&= \sum_{k \in S} P \left\{ X_n = r_l \textrm{ eventually}, X_1=k \vert X_0= t_i\right\} \\
&= \sum_{ k \in S} P \left\{ X_1 = k \vert X_0= t_i\right\} P \left\{ X_n = r_j
{\rm ~eventually~} \vert X_1= k\right\} \\
&= \sum_{ k \in S} P(t_i, k) A( k, r_j) \,=\, P(t_i, r_l) + \sum_{ t_k } P(t_i, t_k) A( t_k, r_j) \,.
\end{aligned}
If A be the L \times M matrix with entries A(t_i, r_l), then this can be written in matrix form as
A \,=\, S + Q A
or
A \,=\, (I-Q)^{-1} S \,=\, MS \,.
Continuing with the Random walk with absorbing boundary conditions, we get
A \,=\, MS \,=\,
\left( \begin{array}{cccc}
1.6 & 1.2 & 0.8 & 0.4 \\
1.2 & 2.4 & 1.6 & 0.8 \\
0.8 & 1.6 & 2.4 & 1.2\\
0.4 & .8 & 1.2 & 1.6
\end{array} \right)
\left( \begin{array}{cc} 1/2 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 1/2 \end{array}\right)
\,=\,
\left( \begin{array}{cc} .8 & .2 \\ .6 & .4 \\ .4 & .6 \\ .2 & .8 \end{array}\right)
For example from state 2 the probability to be absorbed in 0 is .6, and so on….
Tennis game
- We model tennis as a Markov chain: For each point player A will win with probability p and player B will be win with probability q=1-p and we assume all points are independent. Winning a game leads to the Markov chain with following communication diagram.
The Markov chain “moves up” until it reaches one of the 5 states on the top (which we relabel 0,1,2,3,4) and the chains stays on those 5 states forever and performs a random walks with absorbing boundary conditions.
Want the probability that A wins, of course starting from the initial conditions 0-0. Compute this probability by successive conditioning.
Consider the events
D_i=\{ X_n \textrm{ reaches the top row in state } i \}
We compute P(D_i) simply by enumerating all the paths leading up to state i starting form the initial state.
P(D_0)=p^4 + 4 qp^4\,, \quad
P(D_1)=4p^3q^2\,, \quad
P(D_2)= 6p^2q^2\,, \quad
P(D_3)= 4p^2q^3\,, \quad
P(D_4)=q^4 + 4 pq^4 \quad
Conditioning on the B_i’s: \displaystyle P(A) =\sum_{i=0}^4 P(A|D_i) P(D_i)
Obviously we have P(A|D_0) =1 and P(A|D_4) =0. By conditioning on the first step we find the system of equations
\begin{aligned}
P(A|D_1) &= p + q P(A|B_2) & & P(A|D_1) = \frac{p (1-pq)}{1-2pq} \\
P(A |D_2) &= pP(A |D_1) + q P(A |D_3) & \implies \quad\quad & P(A|D_2) = \frac{p^2}{1-2pq}\\
P(A |D_3) &= p P(A|D_2) & & P(A|D_3) = \frac{p^3}{1-2pq}
\end{aligned}
Putting everyting together
P(A) = \frac{p^4 (1+2q)(1+4q^2)}{1-2pq}
Gambler’s ruin
A gambler is coming to the casino to play the game of dice called craps. At this game the standard bet (pass line bet) has even money odds and the probability to win is \frac{244}{495}=0.49292929..., one of the casino games with best odds.
Suppose that the player start with a fortune of j and bets 1 in every game. Their ultimate goal is to reach a fortune of N and, if they do, they would stop. Of course they could lose all their money before reaching they goal.
The game is described by a random walk on \{0,1, \cdots, N\} with absorbing boundary conditions and we want to compute
\alpha_j = P(X_n \textrm{ reaches } N \textrm{ before reaching } 0| X_0=j)
Conditioning on the first step gives the second order homogeneous linear difference equation
\alpha_j = p \alpha_{j+1} + (1-p) \alpha_{j-1} \quad \textrm{ with boundary conditions } \alpha_0=0\,, \alpha_N=1
To solve this second order homogeneous linear difference equations we look for solutions of the form \alpha_j=x^j which leads to the equation
px^2 - x + (1-p) =0 \quad \implies x_1=1\,, x_2= \frac{1-p}{p}
\tag{7.4} This leads to the system of equation (for p\not= \frac12)
\alpha_j= C_1 + C_2 \left(\frac{1-p}{p} \right)^j \quad \textrm{ with } \quad \left\{ \begin{array}{r} C_1+C_2=0 \\ C_1 + C_2\left(\frac{1-p}{p} \right)^N=1 \end{array} \right.
Solving gives the Gambler’s ruin probabilities
\alpha_j = \frac{ 1 - \left( \frac{1-p}{p} \right)^j }{1 -\left( \frac{1-p}{p} \right)^N} \quad \quad \textrm{ Gambler's ruin for } q\not= p
If p=\frac{1}{2} the quadratic equation Equation 7.4 has a double root 1 but we note that \alpha_j= j is a second linearly independent solution of \frac{1}{2}\alpha_{j+1} - \alpha_j + \frac{1}{2}\alpha_{j-1}=0. Solving the equation \alpha_j= C_1 + C_2j with the boundary conditions gives
\alpha(j) \,=\, \frac{j}{N} \quad \quad \textrm{ Gambler's ruin for } p=\frac{1}{2}
How bad does it get?:For example if p=\frac{244}{495} and you start with a fortune a 50 and want to double it, the probability to succeed is \frac{ 1 - \left( \frac{251}{244}\right)^{50}}{1 -\left(\frac{251}{244}\right)^{100}}=.1955, not so great. You could also be more risky and bet an amount of 10 in which case you probability to succeed i a much better \frac{ 1 - \left( \frac{251}{244}\right)^{5}}{1 -\left(\frac{251}{244}\right)^{10}}=.4647. Even better bet everything and your probability to win is p=.4929. (In casino boldness pays, or loses less).
Limiting cases To get a better handle on the gambler’s ruin formula we slightly rephrase the problem:
We start at 0.
We stop whenever we reach W (W is the desired gain) 0r when we reach -L (L is the acceptable losss).
The state are now j \in \left\{ -L, -L+1, \cdots, \cdots, W-1, W\right\}
We get
P(-L, W)\,\equiv\, P\left( {\rm Reach~}W{\rm~before~reaching~} - L {\rm ~starting~from~}0 \right) \,=\, \frac{1 - \left( \frac{1-p}{p} \right)^L}{1 - \left( \frac{1-p}{p} \right)^{L+W}} \,.
The limit L \to \infty describes a player with infinite wealth:
P(-\infty,L)= \left\{\begin{array}{cl}1 & {\rm~if~} p> \frac12 \\ \left(\frac{p}{1-p}\right)^W & {\rm~if~} p< \frac12 \end{array} \right.
Even with infinite wealth it is exponentially hard to win W!
The limit W \to \infty describes a player with fortune L who does not stops unless they lose.
P(-L,\infty) \,=\, \left\{\begin{array}{cl} 1 - \left(\frac{1-p}{p}\right)^L & {\rm~if~} p > \frac{1}{2}
\\ 0 & {\rm~if~} p < \frac{1}{2} \end{array} \right.
The probability to play forever is, unsurprisingly, 0.
Markov chains with countable state space
Examples
We introduce some basic examples of Markov chains on a countable state space
Example 8.1 (Random walk on the non-negative integers) Let us consider a random walk on the set of nonnegative integers with partially reflecting boundary conditions at 0. The transition probabilities are given by
\begin{matrix}
0 \\ 1 \\ 2 \\ \vdots
\end{matrix}
\begin{pmatrix}
q & p & 0 &0 & 0 &\ldots \\
q & 0 & p &0 & 0 &\ldots \\
0 & q & 0 & p & 0& \ldots \\
\vdots & \vdots & \ddots & \ddots & \ddots& \\
\end{pmatrix}
\quad \quad \textrm{ with } q=1-p
Example 8.2 (Discrete-time queueing model) At a service station (think of a cash register), during each time period there is a probability p that an additional customer comes in the queue. The first person in the queue is being served and during each time period there is a probability q that this person exits the queue.
We denote by X_n the number of people in the queue (either in being served or waiting in line). The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are given by
P\,=\,
\begin{matrix}
0\\1\\2\\ \vdots
\end{matrix}
\begin{pmatrix}
1-p & p & 0 &0 & \ldots \\
q(1-p) & qp + (1-p)(1-q) & p(1-q) &0 & \ldots \\
0 & q(1-p) & qp + (1-p)(1-q) & p(1-q) & \\
\vdots& \vdots & \ddots & \ddots & \ddots & \\
\end{pmatrix}
Example 8.3 (Repair shop) A repair shop is able to repair one item on any single day. On day n Z_n items break down and are brought for repair to the repair shop and we assume that Z_n are IID random variables with pdf P\{ Z_n = k\} = a_k for k=0,1,2, \cdots. If X_n denotes the number of item in the shop waiting to be repaired we have
X_{n+1} = \max \{ (X_n -1), 0 \} + Z_n
The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are
P\,=\,
\begin{matrix}
0\\1\\2 \\ \vdots
\end{matrix}
\begin{pmatrix}
a_0 & a_1 & a_2 & a_3 & \cdots \\
a_0 & a_1 & a_2 & a_3 & \cdots \\
0 & a_0 & a_1 & a_2 & \cdots \\
\vdots & \vdots & \vdots & \vdots &
\end{pmatrix}
\quad \quad
\textrm{ with } \sum_{k=0}^\infty a_k =1
Example 8.4 (Success run chain) Imagine a player taking a series of bets labelled 0,1,2, \cdots. The probability to win bet j is p_j. If the player wins bet j they move up to bet j+1 but if they lose they move back to bet 0. If X_n denotes the number of successive winning bets then X_n has state space S = \{ 0,1,2,3, \cdots \} and transition probabilities
P\,=\,
\begin{matrix}
0\\1\\2 \\ \vdots
\end{matrix}
\begin{pmatrix}
q_0 & p_0 & 0 &0 & \ldots \\
q_1 & 0 & p_1 &0 & \ldots \\
q_2 & 0 & 0 & p_2 & \\
\vdots & \vdots & & \ddots & \ddots \\
\end{pmatrix}
\quad \quad \textrm{ with } q_i=1-p_i
This Markov chain has allows for nice analytical computations.
Example 8.5 (Simple d-dimensional random walk) The state space S of the Markov chain is the d-dimensional lattice \mathbb{Z}^d. We denote by {\bf e}_i, i=1, \cdots, d the standard orthonormal basis in \mathbb{R}^d. We view \mathbb{Z}^d as the vertex set of a graph and any point {\bf x=(x_1, \cdots, x_d)} is connected by edges to 2d neighbors {\bf x} \pm {\bf e}_i. For the simple random walk we have
p({\bf x}, {\bf x} \pm {\bf e}_i) \,=\, \frac{1}{2d}
and all the others p({\bf x},{\bf y})=0.
Example 8.6 (Branching process) The branching process, also known as the {} model the evolution over time of populations.
In a unit of time every individual in a population dies and leave behind a random number of descendents.
To describe the Markov chain we will use IID random variables Z_n^{(k)} indexed by $n=0,1,2$ and k=0,1,2,\cdots.
The branching process is given by
X_{n+1} \,=\, \sum_{k=1}^{X_n} Z_n^{(k)}
which simply says that the each of X_n individuals in the population at time n has a random number $ Z_n^{(k)}$ of descendents. It is not convenient to write down the transition probabilities but we will study this process later using its moment generating function.
Transience/recurrence dichotomy
- In Markov chains on a countable state space a new phenomenon occurs compared to finite state space. Suppose the the Markov chain is irreducible (or starts in a closed class), from a state i the Markov chain will return to i with positive probaility but it is also possible that the Markov chain does not return to i and “wander away to infinity”.
We introduce the corresponding definitions of transience and recurrence of a state. Recall that the return time to state i is given by
\tau(i) = \min \{ n \ge 1 \,;\, X_n =i \}\,.
A state i is recurrent if if the Markov chain starting in i will eventually return to i with probability 1, i.e. if
P \left\{ \tau(i) < \infty |X_0=i \right\} = 1 .
A state i is transient if it is not recurrent, that is starting in i the Markov chain return to i with probability q<1, i.e., if
P \left\{ \tau(i)< \infty |X_0=i \right\} = q < 1 .
Recall also the random variable Y(i) which counts the number of visits to state j:
Y(i) = \sum_{k=0}^\infty I_{\{X_k=i\}} \quad \textrm{ with expectation } E[Y(i)|X_0=j] = \sum_{n=0}^\infty P^n(j,i)
Theorem 8.1 (Transience/recurrence dichotomy)
A state i is recurrent \displaystyle \iff P\{ Y(i)=\infty | X_0=i \}=1 \iff \sum_{n=0}^\infty P^{n}(i,i) = \infty.
Moreover if i is recurrent and i \leftrightsquigarrow j then j is recurrent and we have
\sum_{n=0}^\infty P^n(i,j) = \infty
and
P \left\{ \tau(j) < \infty |X_0= i \right\} = 1
The state i is transient \displaystyle \iff P\{ Y(i)<\infty | X_0=i \} =1 \iff \sum_{n=0}^\infty P^{n}(i,i) < \infty Moreover if i is transient and i \leftrightsquigarrow j then j is transient and we have
\sum_{n=0}^\infty P^n(i,j) < \infty
and
P \left\{ \tau(j) < \infty |X_0= i \right\} < 1\,.
Proof. If i is recurrent the Markov chain starting from i will return to i with probability 1, and then by the Markov property will return a second time with probability 1 and therefore infinitely many time with probability 1. This means that Y(i)=\infty almost surely and so \sum_{k} P^{k}(i,i) = + \infty.
If i \leftrightsquigarrow j then then we can find time l and m such that P^l(i,j)>0 and P^m(j,i)>0 and so
\sum_{n=0}^\infty P^{n}(j,j) \ge \sum_{n=0}^\infty P^{n+l+m}(j,j) \ge P^m(j,i) \sum_{n=0}^\infty P^n(i,i) P^l(i,j) = \infty\,,
and j is recurrent. A similar argument shows that \sum_{n=0}^\infty P^n(i,j) = \infty.
It is a consequence of irreducibility that P \left\{\tau(i) < \tau(j) |X_0= j \right\} > 0. (Argue by contraduction, if this probability were 0 by the Markov property, the chain would never visits i starting from j). As a consequence
0\,=\, P \left\{ \tau(j) = \infty |X_0= j \right\} \,\ge\, P \left\{ \tau(i)< \tau(j) |X_0= j \right\}
P \left\{ \tau(j) = \infty |X_0= i \right\}
and therefore P \left\{ \tau(j) < \infty |X_0= i \right\} =1.
On the other hand, if i is transient, by the Markov property again, the random variable Y(i) is a geometric random variable with success probability q<1 which implies that E[Y(i)]=\sum_{k} P^{k}(i,i) = \frac{1}{q} <\infty. This implies all the other equivalence stated. \quad \blacksquare
Transience/recurrence for the simple random walk.
We analyze the recurence and transience propeties for the simple random walk on \mathbb{Z}^d (see Example 8.5).
As we will see this depends on the dimension. To prove recurrence transience here we compute/estimate directly
\sum_{n=0}^\infty P^n(0,0) = \sum_{n=0}^\infty P^{2n}(0,0)
since X_n is periodic with period 2.
- d=1: To return to 0 in 2n steps the Markov chain must take exactly n steps to the left and n steps to the right and thus we have
P^{2n}(0,0) = {2n \choose n} \frac{1}{2^{2n}}
By Stirling’s formula we have n! \sim \sqrt{2 \pi n}e^{-n} n^n where a_n \sim b_n means that \lim a_n/b_n =1. Thus we have
{2n \choose n} \frac{1}{2^{2n}} \sim \frac{1}{2^{2n}} \frac{ \sqrt{2 \pi 2n} e^{-2n} (2n)^{2n}} { 2 \pi n e^{-2n} n^{2n}} = \frac{1}{\sqrt{\pi n}}\,.
Recalling that if a_n \sim b_n then \sum a_n converges if and only if \sum b_n converges we see that the random walk in d=1 is recurrent.
d=2: In dimension 2 to return to 0 in 2n steps the Markov chain must take exactly k steps to the left and k steps to the right and n-k steps up and n-k steps down. Therefore
P^{2n}(0,0) = \sum_{k=0}^n \frac{ 2n!}{ k! k! (n-k)! (n-k)!} \frac{1}{4^{2n}} = \frac{1}{4^{2n}} {2n \choose n} \sum_{k=0}^n {n \choose k} {n \choose n-k}
Now we claim that \sum_{k=0}^n {n \choose k} {n \choose n-k}= {2n \choose n} as can be seen by the counting the number of ways that a team of n can be formed out of n boys and n girls. Therefore
P^{2n}(0,0) = {2n \choose n}^2 \frac{1}{4^{2n}} \sim \frac{1}{\pi n}
and thus the simple random walk is recurrent for d=2 since \sum\frac{1}{n} diverges.
d=3: Similarly as in 2 dimension 2 we find
P^{2n}(0,0) = \sum_{k,j : k+j \le n} \frac{ 2n!}{ j! j! k! k! (n-k-j)! (n-k-j)!} \frac{1}{6^{2n}}
= \frac{1}{2^{2n}} {2n \choose n} \sum_{k,j : k+j \le n} \left( \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} \right)^2
To analyze this quantity we note that, by the multinomial theorem,
\sum_{k,j : k+j \le n} \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} =1
Moreover we have
\sum_{i} q_i =1 \implies \sum_{i}q_i^2 \le max_{i} q_i
and thus we only need to the maximum of \displaystyle \frac{n!}{j! k! (n-j-k)!}. If k_0,j_0 is the maximum then we must have for example
\frac{n!}{(j_0-1)! k_0! (n-j_0-k_0+1)!} \le \frac{n!}{(j_0)! k_0! (n-j_0-k_0)!} \implies 2j_0 \le n -k_0 +1 \,.
Repeating the same computation with j_0\to j_0+1, k_0\to k_0-1, k_0 \to k_0+1 gives the set of inequalities
n- j_0 -1 \le 2 k_0 \le n- j_0 +1 \, \quad \textrm{ and } \quad n- k_0 -1 \le 2 j_0 \le n- k_0 +1
which implies that
\frac n3 -1 \le j_0, k_0 \le \frac n3 +1
i.e. j_0 and k_0 are of order n/3. Using Stirling’s formula
P^{2n}(0,0) \le \frac{1}{2^{2n}} {2n \choose n} \frac{1}{3^n} \frac{n!}{ (n/3)! (n/3)! (n/3)! } \sim \frac{3 \sqrt{3}}{2} \frac{1}{ (\pi n)^{3/2}}
which shows that the random walk is transient in dimension 3.
Transience/recurrence for the succcess run chain
Continuing with Example 8.4 we consider the return time to state 0, \tau(0) whose pdf we cab compute explicitly its p.d.f since to return to 0 the only possible paths are 0 \rightarrow 0, 0 \rightarrow 1 \rightarrow 0, 0 \rightarrow 1 \rightarrow 2 \rightarrow 0, and so on. We set u_n \equiv p_0p_1 \cdots p_{n-1} and find
P( \tau(0) = k|X_0=0) = p_0 p_1 p_{k-2}q_{k-1}= p_0 p_1 p_{k-2}(1-p_{k-1}) = u_{k-1} - u_k\,.
Therefore
P( \tau(0) \le n |X_0=0) = \sum_{k=1}^n P( \tau(0) = k|X_0=0) = (1- u_0) + (u_0-u_1) + \cdots + (u_{n-1} - u_n) = 1 - u_n
and
P( \tau(0) < \infty|X_0=0) = 1 - \lim_{n\to \infty} u_n
and we obtain that
\textrm{The success run chain is recurrent if and only if } \lim_{n \to \infty} u_n = \lim_{n \to \infty} p_0 \cdots p_{n-1} =0
To have a better handle on this criterion we need a little result from analysis about infinite products.
Lemma 8.1 With q_k=1-p_k and u_n= \prod_{k=0}^{n-1} u_k
\lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k = 0 \iff \sum_{k=0}^\infty {q_k} = \infty
\lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=0}^\infty {q_k} < \infty
Proof. We have
\displaystyle \prod_{k=0}^{n-1} p_k > 0 \iff \infty > - \log (\prod_{k=0}^{n-1} p_k ) =- \sum_{k=1}^n \log p_k = - \sum_{k=1}^n \log (1-q_k) Taking n \to \infty shows that
\lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=1}^\infty \log (1-q_k) \textrm{ converges.}
For this to happen we must have \lim_{k \to \infty} q_k =0. But since \lim_{x \to 0} \log(1-x)/x =1 by L’Hospital rule, we have that \sum_{k=1}^n \log (1-q_k) converges if and only if \sum_{k=1}^n q_k converges. \quad \blacksquare
For the chain to be transient q_k must go to 0 fast enough. If say q_i= \frac{1}{i} then the chain is recurrent while if q_i= \frac{1}{i^2} then it is transient.
Another criterion for transience
We add one more method to establish transience. For this pick reference state j and consider the hitting time to state j, \sigma(j) (not the return time but we will play with both.)
\alpha(i)\,=\, P \left\{ \sigma(j)< \infty | X_0= i \right\}
By definition we have \alpha(j)=1 since \sigma(j)=0 if X_{0} = j. If the chain is transient we must have
\alpha(i) < 1 \quad \textrm { for } i \not= j \,.
since by Theorem 8.1 for i \not=j_0 we have \alpha(i)= P \left\{ \tau(j) < \infty |X_0= i \right\} < 1.
Let us derive an equation for \alpha(i) by conditioning of the first step. For i \not=j
\begin{aligned}
\alpha(i) &= P \left( \sigma(j) < \infty | X_0= i \right) = P \left( \tau(j) < \infty | X_0= i \right) \\
&= \sum_{k \in S} P \left( \tau(j) < \infty | X_1=k \right) P(i,k) = \sum_{k \in S} P \left( \sigma(j)< \infty | X_0=k \right) P(i,k) \\
&= \sum_{k \in S} P(i,k) \alpha(k)
\end{aligned}
and thus \alpha(i) satisfies the equation
P\alpha(i) = \alpha(i) \,, \quad i \not= j \,.
Theorem 8.2 (A criterion for transience) An irreducible Markov chain X_n is transient if and only if for some state j_0 there exists a solution for the equation
P\alpha(i) = \alpha(i) \textrm { for } i \not= j_0
\tag{8.1} such that
\alpha(j_0)=1 \quad \textrm { and } \quad 0 < \alpha(i) < 1 \quad \textrm { for } i \not= j_0
\tag{8.2}
Proof. We have already established the necessity. In order to show the sufficiency assume that we have found a solution for Equation 8.1 and Equation 8.2. Then for i \not= j_0 we have, using repeatedly the equation P\alpha(i)=\alpha(i)
\begin{aligned}
1 &> \alpha(i) = P\alpha(i) = P(i, j_0) \alpha(j_0) + \sum_{j \not= j_0} P(i, j) \alpha(j) = P(i, j_0) + \sum_{j \not= j_0} P(i, j) P\alpha(j) \\
&= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j \not= j_0, k \not= j_0} P(i, j) P(j,k) \alpha(k) \\
&= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j,k \not= j_0} P(i, j) P(j,k) P(k,j_0) + \cdots \\
&= P(\tau(j_0)=1|X_0=i) + P(\tau(j_0)=2|X_0=i) + P(\tau(j_0)=3|X_0=i) + \cdots \\
&= P(\tau(j_0)< \infty|X_0=i)\,.
\end{aligned}
which establishes transience. \quad \blacksquare
Transience/recurrence for the RW on \{0,1,2,\}
- Continuing with Example 8.1 we use Theorem 8.2. We pick 0 as the reference state and for j \not=0 solve the equation
P\alpha(j) = P(j, j-1) \alpha(j-1) + P(j, j+1) \alpha(j+1) = (1-p) \alpha(j-1) + p \alpha(j+1) = \alpha(j)
whose solution is (like for the Gambler’s ruin problem)
\alpha(j) = \left\{ \begin{array}{cl} C_1 + C_2 \left( \frac{1-p}{p}\right)^j & \textrm{ if } p \not= \frac{1}{2} \\
C_1 + C_2 j & \textrm{ if } p = \frac{1}{2}
\end{array}
\right. \,.
Using that \alpha(0)=0 we find
\alpha(j) = \left\{ \begin{array}{cl} (1-C_2) + C_2 \left( \frac{1-p}{p}\right)^j & \textrm{ if } p \not= \frac{1}{2} \\
(1-C_2) + C_2 j & \textrm{ if } p = \frac{1}{2}
\end{array}
\right. \,.
and we see the condition 0 < \alpha(i) < 1 is possible only if (1-p)/p < 1 (that is p > 1/2) and by choosing C_2=1. Thus we conclude
\textrm{ The random walk on } \{0,1,2, \cdots\} \textrm{ is } \left\{ \begin{array}{l} \textrm{transient for } p> \frac{1}{2} \\
\textrm{recurrent for } p \le \frac{1}{2}
\end{array}
\right. \,.
Positive recurrent Markov chains
Positive recurrence versus null recurrence
A finite state irreducible Markov chain is always recurrent, P(\tau(i) < \infty |X_0=j)=1 and we have proved Kac’s formula for the invariant measure \pi(i) = E[ \tau(i)|X_0=i]^{-1}, that is the random variable \tau(i) has finite expectation.
For a countable state space it is possible for a Markov chain to be recurrent but that \tau(i) does not have finite expectation. This motivates the following definitions.
A state i is positive recurrent if E[ \tau(i)|X_0=i] < \infty
A state i is null recurrent if it is recurrent but not positive recurrent
We first investigate the relation between recurrence and existence of invariant measures. We first show that if one state j is positive recurrent then there exists a stationary distribution. The basic idea is to decompose any path of the Markov chain into successive visits to the state j. To build up our intuition if a stationary distribution were to exists it should measure the amount of time spent in state i and to measure this we introduce
\mu(i) = E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \quad = \textrm{number of visits to $i$ between two successive visits to $j$}.
Note that \mu(j)=1 since X_0=j, and if j is positive recurrent
\sum_{i} \mu(i) = \sum_{i\in S} E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \,=\, E\left[ \tau(j) \,|\, X_0=j \right]
and thus
\pi(i)= \frac{\mu(i)}{E[ \tau(i)|X_0=i]}
is a probability distribution.
Theorem 9.1 For a recurrent irreducible Markov chain X_n and a fixed state j, \mu(i) = E[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j ] is stationary in the sense that
\mu P = \mu
If the state j is positive recurrent then \mu can be normalized to a stationary distribution \pi.
Proof. The chain visits j at time 0 and then only again at time \tau(j) and thus we have the two formulas
\begin{aligned}
\mu(i) &= E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] = \sum_{n=0}^\infty P( X_{n}=i, \tau(j)> n | X_0=j)\\
&= E\left[ \sum_{n=1}^{\tau(j)} {\bf 1}_{\{X_n=i\}}| X_0=j \right]
= \sum_{n=1}^{\infty} P( X_n=i, \tau(j) \ge n | X_0=j )
\end{aligned}
We have then, by conditioning on the last step, and using \mu(j)=1
\begin{aligned}
\mu(i) &= \sum_{n=1}^{\infty} P( X_n=i, \tau(j)\ge n | X_0=j ) = P(j,i) + \sum_{n=2}^\infty P( X_n =i, \tau(j)\ge n | X_0=j) \\
&= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P( X_n =i, X_{n-1}=k, \tau(j)\ge n | X_0=j) \\
&= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)\ge n | X_0=j) \\
&= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)> n-1 | X_0=j) \\
&= P(j,i) +\sum_{k \in S, k\not=j} \sum_{m=1}^\infty P(k,i) P( X_{m}=k, \tau(j)> m | X_0=j) \\
&= \mu(j) P(j,i) + \sum_{k\not=j} E\left[ \sum_{m=1}^{\tau(j)-1} {\bf 1}_{\{X_m=k\}}| X_0=j \right] P(k,j) \\
&= \mu(j) P(j,i) + \sum_{k\not=j} \mu(k) P(k,j) \,=\, \sum_{k} \mu(k) P(k,j)
\end{aligned}
which proves the invariance of \mu. If the chain is positive recurrent we have laready seen that \mu is normalizable. \quad \blacksquare.
Stationarity and irreducibility implies positive recurrence
Theorem 9.2 Assume the irreducible Markov chain has a stationary distribution \pi(i) then \pi(i)>0 for any i and we we have Kac’s formula \pi(i) = E \left[ \tau(i) | X_0=i \right]^{-1}. In particular all states are positive recurrent and the stationary distribution is unique.
Proof. Let us assume that \pi is invariant. We first show that the chain must be recurrent. If the chain were transient then we would have P^n(i,j) \to 0 as n \to \infty and so by dominated convergence
\pi(i) = \sum_{j} \pi(j) P^n(j,i) \to 0 \textrm{ as } n \to \infty \,.
which is impossible.
The fact that \pi(i)>0 for all i\in S is proved as for finite state space see the argument in Theorem 6.2.
To prove positive recurrence we use a clever argument involving the time reversed chain (more on time reversal in Chapter 12 and in the exercises). Consider the Markov chain with transition matrix Q(i,j)=\frac{\pi(j) P(j,i)} {\pi(i)}. It is easy to verify that Q(i,j) is a transition matrix and that \pi is stationary for Q, \pi Q=\pi. We denote by Y_n the Markov chain with transition matrix Q, since \pi(i)>0 this Markov chain has the same communication structure as X_n and is irreducible since X_n is. By the previous argument Y_n must be recurrent.
Next we write
\pi(i) E[ \tau(i) |X_0=i] \,=\, \pi(i) \sum_{n=1}^\infty P (\tau(i) \ge n |X_0=i)
The event \{\tau(i) \ge n \} conditioned on \{ X_0=i\} correspond to a sequence of states
i_0=i, i_1, \cdots \cdots, i_{n-1}, i_n=j
where i_1, \cdots, i_{n-1} cannot be equal to i and j can be any state. Using repeatedly the relation \pi(i)P(i,j)= \pi(j)Q(j,i) the probability of such event can be written as
\begin{aligned}
\pi(i) P(i,i_1) \cdots P(i_{n-1}, i_n) &= \pi(j) Q(j,i_{n-1}) \cdots Q(i_1, i)
\end{aligned}
For the Markov chain Y_n this correspond to a path starting in j and returning to i after exactly n steps. Therefore we find
\begin{aligned}
\pi(i) E[ \tau(i) |X_0=i] = \pi(i) \sum_{n =1}^\infty P (\tau(i) \ge n |X_0=i)
&= \sum_{j \in S} \pi(j)\sum_{n=1}^\infty P ( \tau(i) =n | Y_0 = j) \\
&= \sum_{j \in S} \pi(j) P( \tau(i) < \infty|Y_0 = j) =1 \,.
\end{aligned}
where for the last equality we have used the recurrence of the time reversed chain. This shows Kac’s formula which implies the uniqueness of the stationary distribution and that all states are positive recurrent. \quad \blacksquare
Ergodic theorem for countable Markov chains
Theorem 9.3 (Ergodic theorem for countable state space Markov chains)
- If X_n is irreducible and positive recurrent then there exists a unique stationary distribution \pi and for any initial distribution \mu we have
\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_k=j\}} \,=\, \pi(j) \,,
with probability 1. In particular \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} \mu P^k(j) \,=\, \pi(j),. Moroever \pi we have the Kac’s formula \pi(j) =\frac{1}{E[ \tau(j) |X_0= j]}. Conversely if an irreducible Markov has a stationary distribution then it is positive recurrent.
Proof. We have actually already proved all of it. Positive recurrence implies the existence of the stationary distribution (Theorem 9.1) and Kac’s fromula is from Theorem 9.2 which implies uniqueness of the stationary distribution. We can now repeat the proof of Theorem 6.5 to show that if X_0=i with i arbitrary we have
\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_k=j\}} = \pi(j) \,.
\tag{9.1} The reader should verify that the proof of Theorem 6.5 only use positive recurrence and not the finiteness of the state space. Taking now expectation of Equation 9.1 and summing over initial condition we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^{n} \mu P^k(j) \,=\, \pi(j). \quad \blacksquare
Convergence to equilibrium via coupling
We continue our theoretical consideration by proving that if the chain is aperiodic then the distribution of X_n converges to \pi(j).
Theorem 9.4 Suppose X_n is an irreducible positive recurrent aperiodic Markov chain. Then for any initial distribution mu we have
\lim_{n \to \infty} \mu P^n(j) \,=\, \pi(j) \,.
Proof. We will use a coupling argument: we take two independent copies X_n and Y_n of the Markov chain where X_n is starting in the initial distribution \mu while Y_n is starting in the stationary distribution \pi.
The idea is to consider the coupling time
\sigma = \inf\{ n \ge 1 ; X_n = Y_n\} \,.
At the (random) time \sigma, X_n and Y_n are in the same state and after that time X_n and Y_n must have the same distribution by the (strong) Markov property. But since Y_n is distributed according to \pi so must X_n be as well and thus, at the coupling time \sigma, X_n has reached stationarity.
Let us now consider the chain Z_n=(X_n,Y_n) with transition probabilities
P( Z_{n+1} = (k,l) \,|\, Z_n=(i,j) ) = P(i,k) P(j,l)
and stationary distribution \pi( i,j) = \pi(i) \pi(j). Since X_n and Y_n are aperiodic, given states i,j,k,l we can find n_0 such that for every n \ge n_0 we have P^n(i,k) > 0 and P^n(j,l)>0. This implies that Z_n is irreducible and thus, since a stationary measure exists, by Theorem 9.3 the chain Z_n is positive recurrent. Since the coupling time is the first time the Markov chain Z_n hits a state of the form (j,j), recurrence of Z_n implies that P( \sigma < \infty)=1 and thus P(\sigma >n ) \to 0.
To conclude
\begin{aligned}
\left| \mu P^n(j) - \pi(j)\right| & = \left| P( X_n=j) - P(Y_n =j) \right| \\
& \le \left| P( X_n=j , \sigma \le n ) - P(Y_n =j , \sigma\le n) \right| + \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\
& = \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\
& = \left| E[ ({\bf 1}_{\{X_n=j\}} - {\bf 1}_{\{Y_n=j}\}) {\bf 1}_{\{\sigma > n\}} ] \right| \\
& \le E[ {\bf 1}_{\{\sigma > n\}}] = P( \sigma > n) \rightarrow 0 \textrm{ as } n \to \infty
\end{aligned}
and this prove the convergence. \quad \blacksquare.
- The idea of coupling used in Theorem 9.4 is an instance of powerful idea. First we can use other coupling that the one used here and if we can bound the tail behavior of the coupling time then we can control the speed of convergence to the stationary distribution! We will exploit this idea later on.
Examples
- Positive recurrence for the random walk on \{0,1,2, \cdots\}. Continuing with Example 8.1, we use Theorem 9.3 to establish positive recurrence by computing the stationary distribution. We find the equations
\pi(0) (1-p) + \pi(1)(1-p) = \pi(0) \quad \textrm{ and } \quad \pi(j+1)(1-p) + \pi(j-1)p = \pi(j) \quad \textrm{ for }j \ge 1
The second equation has the general solution
\pi(n)=C_1 + C_2 \left(\frac{p}{1-p}\right)^n \,\,\,(\textrm{if }p\not=\frac12) \quad \textrm{ and } \quad \pi(n)=C_1 + C_2n\,\,\, (\textrm{ if } p=\frac12)
and the first equation gives \pi(1)=\frac{p}{1-p}\pi(0). For p=\frac{1}{2} we cannot find a normalized solution. For p < \frac{1}{2} we can choose C_1=0 and \pi(n) = \left(\frac{p}{1-p}\right)^n \pi(0) can be normalized toi find \pi(n) = \frac{1-2p}{1-p} \left(\frac{p}{1-p}\right)^n. If p> \frac{1}{2} we already know the Markov chain is transient and thus
\textrm{ The random walk on } \{0,1,2, \cdots\} \textrm{ is } \left\{ \begin{array}{l} \textrm{transient for } p> \frac{1}{2} \\
\textrm{null recurrent for } p = \frac{1}{2} \\
\textrm{positive recurrent for } p < \frac{1}{2} \\
\end{array}
\right. \,.
- Positive recurrence for the success run chain: Continuing with Examples Example 8.4 we determine recurrence by solving \pi P=\pi. This gives the equations
\begin{aligned}
\pi(0) &= \pi(0) q_0 + \pi(1)q_1 + \cdots = \sum_{n=0}^\infty \pi(n) q_n \\
\pi(1) &= \pi(0) p_0\,, \quad \pi(2) = \pi(1) p_1, \quad \cdots
\end{aligned}
From the second line we find \pi(n) = \pi(0) p_0 p_1 \cdots p_{n-1} and inserting into the first equation
\pi(0) = \pi(0) \left[ (1-p_0) + p_0 (1-p_1) + p_0 p_1(1-p_2) + \cdots \right] = \pi(0) \left[ 1 - \lim_{n \to \infty} p_0 p_1 \cdots p_{n-1}\right]
Recall that recurrence occurs provided \lim_{n\to \infty} \prod_{j=0}^{n-1} p_j = 0 and so if X_n is recurrent there exists a solution of \pi P=\pi and we can normalize \pi provided
\sum_{n=1}^\infty \prod_{j=0}^{n-1} p_j < \infty
Therefore we obtain
\textrm{ The success run chain is } \left\{ \begin{array}{l} \textrm{transient if } \lim_{n} \prod_{j=0}^{n-1} p_j >0 \\
\textrm{recurrent if } \lim_{n} \prod_{j=0}^{n-1} p_j = 0 \\
\textrm{positive recurrent if } \sum_{n} \prod_{j=0}^{n-1} p_j < \infty \\
\end{array}
\right. \,.
Balance equation
Consider a Markov chain with transition probabilities P(i,j) and stationary distribution \pi(i). We can rewrite the equation for stationarity,
\pi(i)=\sum_{j} \pi(j) P(j,i)
as
\sum_{j} \pi(i) P(i,j) \,=\, \sum_{j} \pi(j) P(j,i) \,.
\tag{11.1} which we are going to interpret as a balance equation.
We introduce the (stationary) probability current from i to j as
J(i,j) \equiv \pi(i) P(i,j)
\tag{11.2} and Equation 11.1 can be rewritten as
\underbrace{\sum_{ j} J(i,j)}_{\textrm{total current out of state } i} \,=\, \underbrace{\sum_j J(j,i) }_{\textrm{total current into of state i}} \, \quad \quad \textrm{ balance equation }
\tag{11.3} i.e., to be stationary the total probability current out of i must be equal to the total probability current into i.
Detailed balance
A stronger condition for stationarity can be expressed in terms of the balance between the currents J(i,j). A Markov chain X_n satisfies detailed balance if there exists \pi(i)\ge 0 with \sum_{i} \pi(i)= 1 such that for all i,j we have
\pi(i) P(i,j) \,=\, \pi(j) P(j,i) \quad \textrm{ or equivalently } \quad J(i,j) = J(j,i)
\tag{11.4} This means that for every pair i,j the probability currents J(i,j) and J(j,i) balance each other.
Clearly Equation 11.4 is a stronger condition than Equation 11.3 and thus we have
Lemma 11.1 If the Markov chain satisfies detailed balance for a probability distribution \pi then \pi is a stationary distribution.
But it is easy to see that detailed balance is a stronger condition than stationarity. The property of detailed balance is often also called (time)-reversibility since we have the following results which states that the probability of any sequence is the same as the probability of the time reversed sequence.
Theorem 11.1 (Time reversibility) Suppose the Markov chain X_n satisfies detailed balance and assume that the initial distribution is the stationary distribution \pi. Then for any sequnce of states i_0, \cdots i_n we have
P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\} \,=\, P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\}
Proof. Using Equation 11.4 repeatedly we find
\begin{aligned}
P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\}
& \,=\, \pi(i_0) P(i_0, i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\
& \,=\, P(i_1, i_0) \pi(i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\
& \,=\, P(i_1, i_0) P(i_2, i_1) \pi(i_2) \cdots P(i_{n-1}, i_n) \\
& \,=\, \cdots \\
& \,=\, P(i_1, i_0) P(i_2, i_1) \cdots \pi(i_{n-1}) P(i_{n-1}, i_n) \\
&\ \,=\, P(i_1, i_0) P(i_2, i_1) \cdots P(i_n, i_{n-1}) \pi(i_n) \\
&\,=\,P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\}
\end{aligned}
\blacksquare
The next result is very easy and very useful.
Examples
Random walk on the hypercube \{0,1\}^m. The state space S is
S= \{0,1\}^m \,=\, \left\{ \sigma = ( \sigma_1, \cdots, \sigma_m) \,;\, \sigma_i \in \{0,1\} \right\}
To define the move of the random walk, just pick one coordinate j \in \{ 1, \cdots, m\} and flip the j^{th} coordinate, i.e., \sigma_j \to 1- \sigma_j. We have thus
P(\sigma, \sigma') =\left\{ \begin{array}{cl} \frac{1}{m} & {\rm if~}\sigma {\rm~and~}\sigma' {\rm~differ~by~one~coordinate}
\\ 0 &{\rm otherwise}
\end{array}
\right.
Clearly P is symmetric and thus \pi(\sigma) \,=\, 1/ 2^m.
Random walk on the graph G=(E,V) has transition probabilities p(v, w) = \frac{1}{ {\rm deg}(v)}. This Markov chain satifies detailed balance with the (unnormalized) \mu(v) = deg(v). Indeed we have P(v,w) >0 \iff P(w,v)>0 and thus if P(v,w)>0 we have
\mu(v) P(v, w) \,=\, {\rm deg}(v) \frac{1}{{\rm deg} (v)} = 1 \,=\, {\rm deg}(w) \frac{1}{{\rm deg} (w)}= \mu(w) P(w, v) \,.
This is slightly easier to verify that the stationary equation \pi P = \pi. After normalization we find \pi(v) = {\rm deg}(v)/2|E|.
For example for the simple random walk on \{0, 1, \cdots, N\} with reflecting boundary conditions we find \pi \,=\, \left( \frac{1}{2N}, \frac{2}{2N}, \cdots, \frac{1}{2N} \right).
- Network Markov chain: The previous example can be generalized as follows. For a given graph G=(E,V) let us assign a positive weight c(e)>0 to each (undirected) edge e= \{v,w\}, that is we choose numbers c(v,w)=c(w,v) with c(v,w)=0 iff v and w are not connected by an edge. If the transition probabilities are given by
P(v,w) \,=\, \frac{c(v,w)}{c(v)} \quad \textrm{ with } c(v) \,=\, \sum_{w} c(v, w) \,,
then it is easy to verify that the Markov chain satisfies detailed balance with
\pi(v) \,=\, \frac{c(v)}{c_G} \quad \textrm{ with } c_G \,=\, \sum_{v} c(v) \,.
- Birth-Death Processes Markov chain: Let us consider a Markov chain on the state space S=\{0, \cdots, N\} (N could be infinite) with transition probabilities have the following tridiagonal structure
\begin{matrix}
0 \\1\\2\\3\\ \cdots
\end{matrix}
\begin{pmatrix}
r_0 & p_0 & 0 & 0& \cdots \\
q_1 & r_1 & p_1 & 0 & \cdots \\
0 & q_2 & r_2 & p_2 & \cdots \\
\vdots & &\ddots&\ddots & \ddots
\end{pmatrix}
This is called a birth and death process since the only possible transition are to move up or down by unit or stay unchanged. Random walks (see Example 5.5 and Example 8.1), discrete queueing models (Example 8.2 ), the Ehrenfest urn model Example 5.6 are special case of birth and death processes.
Birth and death Markov chain always satisfy detailed balance. Indeed the only non trivial detailed balance conditions are
\pi(j)P(j,j+1) = \pi(j+1) P(j+1,j) \implies \pi(j) p_j \,=\, \pi(j+1) q_{j+1} \,, \quad \textrm{ for } j=0, \cdots, N-1 \,.
and this can be solved recursively. We obtain
\begin{aligned}
\pi(1) \,&=\, \pi(0) \frac{p_0}{q_1} \\
\pi(2) \,&=\, \pi(1) \frac{p_1}{q_2} \,=\, \pi(0) \frac{p_0 p_1 }{q_1 q_2} \\
& \vdots \\
\pi(N) \,&=\, \pi(0) \frac{p_0 p_1 \cdots p_{N-1}}{q_1 q_2 \cdots q_{N-1}}
\end{aligned}
and with normalization \pi(j) \,=\, \frac{ \prod_{k=1}^j \frac{p_{k-1}}{q_k}}{ \sum_{l=0}^N \prod_{k=1}^l \frac{p_{k_1}}{q_k}}. If N is infinite we need the sum in the deominator to be finite.
For example the Ehrenfest urn in Example model has
p_j \,=\, \frac{N-j}{N} \,, \quad q_j \,=\, \frac{j}{N}
and thus we obtain
\pi(j) \,=\, \pi(0) \frac{ \frac{N}{N} \frac{N-1}{N} \cdots \frac{N-(j-1)}{N}}{ \frac{1}{N} \frac{2}{N} \cdots \frac{j}{N}}
\,=\, \pi(0) { N \choose j } \quad \textrm{ and also } \pi(0) =\sum_{j=0}^N {N \choose j} \,=\, 2^N.
MCMC
Suppose you are given a certain probability distribution \pi on a set S and you goal is to generate a sample from this distribution. The Monte-Carlo Markov chain (MCMC) method consists in constructing an irreducible Markov chain X_n whose stationary distribution is \pi. Then to generate \pi one simply runs the Markov chains X_n long enough such that it is close to its equilibrium distribution. It turns out that using the detailed balance condition is a very useful tool to construct the Markov chain in this manner.
A-priori this method might seems an unduly complicated way to sample from \pi. Indeed why not simply simulate from \pi directly using one of the method of Section Simulation
To dispel this impression we will consider some concrete examples but for example we will see that often one want to generate a uniform distribution on a set S whose cardinality |S| might be very difficult.
A large class of model are so-called “energy models” which are described by an explicit function f:S \to \mathbb{R} interpreted as the energy (or weight) of the state. Then one is interested in the stationary distribution
\pi(i) = Z^{-1} e^{-f(i)} \quad \textrm { where } Z= \sum_{i}e^{-f(i)} \textrm{ is a normalization constant}
As we will see computing the normalization constant Z can be very difficult and so one cannot directly simulate from \pi!
The measure \pi assign biggest probability to the states i where f(i) is the smallest, that is they have the smallest energy.
Often (e.g. in economics) the measure is written with a different sign convention \pi(i) = Z^{-1}e^{f(i)} which now favor the states with the highest values of f.
Proper q-coloring of a graph
For a graph G=(E,V) a proper q-coloring consists of assigning to each vertex v of the graph one of q colors subject to the constraint that if 2 vertices are linked by an edge they should have different colors. The set S' of all proper q-coloring which is a subset of S \,=\, \left\{ 1, \cdots, q\right\}^V. We denote the elements of S by \sigma = \{ \sigma(v)\}_{v \in V} with \sigma(v) \in \{1, \cdots, q\}. The uniform distribution on all such proper coloring is \pi(\sigma) = 1/|S'| for all \sigma \in S'. Even for moderately complicated graph it can be very difficult to compute to |S'|.
Consider the following transition rules:
1. Choose a vertex v of at random and choose a color a at random. 2. Set \sigma'(v) = a and \sigma'(w)=\sigma(w) for w \not=v. 3. If \sigma' is a proper q-coloring then set X_{n+1}=\sigma'. Otherwise set X_{n}=\sigma.
The transition probabilities are then given by
\begin{aligned}
P(\sigma, \sigma') =\left\{
\begin{array}{cl}
\frac{1}{q |V|} & \textrm{ if } \sigma \textrm{ and } \sigma' \textrm{ differ at exactly one vertex} \\
0 & \textrm{ if }\sigma \textrm{ and } \sigma' \textrm{ differ at more than one vertex} \\
1 - \sum_{\sigma' \not = \sigma} P(\sigma, \sigma') & \textrm{ if } \sigma'=\sigma
\end{array}
\right.
\end{aligned}
Note that P(\sigma, \sigma) is not known explicitly either but is also not used to run the algorithm. The cardinality |S'| is also not needed.
In order to check that the uniform distribution is stationary for this Markov chain it note that P(\sigma, \sigma') is symmetric matrix. If one can change \sigma into \sigma' by changing one color then one can do the reverse transformation too.
Knapsack problem
This is a classical optimization problem. You own m books and the i^{th} book has weight w_i lb and is worth $ v_i. In your knapsack you can put at most a total of b pounds and you are looking to pack the most valuable knapsack possible.
To formulate the problem mathematically we introduce
\begin{aligned}
w \,&=\, (w_1, \cdots w_m) \in {\bf R}^m &\textrm{ weight vector} \\
v \,&=\, (v_1, \cdots v_m) \in {\bf R}^m &\textrm{ value vector} \\
\sigma\,&=\, (\sigma_1, \cdots \sigma_m) \in \{0,1\}^m &\textrm{ decision vector}
\end{aligned}
where we think that \sigma_i =1 if the i^{th} book is in the knapsack. The state space is
S' \,=\, \left\{ \sigma \in \{0,1\}^{m} \,;\, \sigma \cdot w \le b \right\}
and the optimization problem is
\textrm{Maximize } v \cdot \sigma \textrm{ subject to the constraint } \sigma \in S'
As a first step we generate a random element of S' using the simple algorithm. If X_n=\sigma then
- Choose j \in \{1, \cdots m\} at random.
- Set \sigma' \,=\, ( \sigma_1, \cdots, 1-\sigma_j, \cdots, \sigma_m).
- If \sigma' \in S', i.e., if \sigma' \cdot w \le b then let X_{n+1}=\sigma'. Otherwise X_{n+1}=\sigma.
In other words, choose a random book. If it is in the sack already remove it. If it is not in the sack add it provided you do not exceed the the maximum weight. Note that the Markov chain X_n is irreducible, since each state communicates with the state \sigma=(0, \cdots, 0). It is aperiodic except in the uninteresting case where \sum_{i} w_i \le b. Finally the transition probabilities are symmetric and thus the uniform distribution the unique stationary distribution.
In the knapsack problem we want to maximize the function f(\sigma)=\sigma\cdot v on the state space. One possible algorithm would be to generate an uniform distribution on the state space and then to look for the maximum value of the function. But it would be a better idea to sample from a distribution which assign higher probabilities to the states which we are interested in, the ones with a high value of f.
Let S be the state space and let f \,:\, S \to \mathbb{R} be a function. It is convenient to introduce the probability distributions define for \beta>0 by
\pi_\beta(i) \,=\, \frac{e^{\beta f(i)}}{Z_\beta} \, \quad \textrm{ with } \quad Z_\beta =\sum_{j \in S} e^{\beta f(j)} \,.
Clearly \pi_\beta assign higher weights to the states i with bigger values of f(i). Let us define
S^* \,=\, \left\{ i \in S \,;\, f(i) = f^* \equiv \max_{j \in S} f(j) \right\} \,.
If \beta=0 then \pi_0 is simply the uniform distribution on S. For \beta \to \infty we have
\lim_{\beta \to \infty} \pi_\beta(i) \,=\, \lim_{\beta \to \infty} \frac{ e^{\beta (f(i)- f^*)}} { |S^*| + \sum_{j \in S\setminus S^*} e^{\beta (f(j) - f^* )}} \,=\,
\left\{
\begin{array}{cl} \frac{1}{|S^*|} & {\rm ~if~} i \in S^* \\ 0 & {\rm ~if~} i \notin S^* \\
\end{array}
\right. \,,
i.e., for large \beta \pi_\beta is concentrated on the global maxima of f.
Metropolis algorithm
A fairly general method to generate a distribution \pi on the state space S is given the Metropolis algorithm. This algorithm assumes that you already know how to generate the uniform distribution on S by using a symmetric transition matrix Q.
Theorem 12.1 (Metropolis algorithm) Let \pi be a probability distribution on S with \pi(i)>0 for all i and let Q be a symmetric transition matrix. Consider the Markov chain with the following transition matrix (the Metropolis algorithm). If X_n=i
- Choose Y \in S according to Q, i.e., P\{ Y=j \,\vert\, X_n=i\} \,=\, Q(i,j)
- Define the acceptance probability \alpha(i,j) \,=\, \min \left \{ 1 \,, \frac{\pi(j)}{\pi(i)} \right\}
- Accept Y with probability \alpha(i,Y) by generating random number U. If U \le \alpha then set X_{n+1}= Y (i.e., accept the move) and if U > \alpha then X_{n+1}= X_n (i.e., reject the move).
If Q is an irreducible transition probability matrix on S then the Metropolis algorithm defines an irreducible Markov chain on S which satisfies detailed balance with stationary distribution \pi.
Proof. Let P(i,j) be the transition probabilities for the Metropolis Markov chain. Then we have
P(i,j) \,=\, Q(i,j) \alpha \,=\, Q(i,j) \min \left\{ 1 \,,\, \frac{\pi(j)}{\pi(i)} \right\} \,.
Since we assume \pi(i)>0 for all states i, the acceptance probability \alpha never vanishes. Thus if P(i,j)>0 whenever Q(i,j)>0 and thus P is irreducible if Q is itself irreducible.
In order to check the reversibility we note that
\pi(i) P(i,j) \,=\, Q(i,j) \pi(i) \min \left\{ 1, \frac{\pi(j)}{\pi(i)} \right\} \,=\, Q(i,j) \min \left\{\pi(i)\,,\, \pi(j) \right\}
and the r.h.s is symmetric in i,j and thus \pi(i) P(i,j) =\pi(j) P(j,i). \quad \blacksquare
Note that only the ratios \frac{\pi(j0}{pi(j)} are needed to run the algorithm, in particular we do not need the normalization constant. This is a very important feature of the Metropolis algorithm.
We could have chosen another acceptance probability \alpha(i,j). By inspection of the proof it is enough to pick \alpha(i,j) such that \alpha(i,j) \le 1 and \pi(i) \alpha(i,j) = \pi(j) \alpha(j,i) is symmetric. Some such examples will be considered in the homework and the choice given in Theorem 12.1 is actually optimal in the sense it gives the highest acceptance probability.
The general case with a non-symmetric proposal matrix Q is called the Metropolis-Hastings algorithm. In this case we use the acceptance probability is chosen to be
\alpha(i,j) = \min \left\{ 1, \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)} \right\}
and it is not difficult to check (see Homework) that the Metropolis-Hasting algorithm yields a reversible Markov chain with stationary distribution \pi.
Metropolis algorithm for the knapsack problem
Consider the distribution \pi_\beta(\sigma) \,=\, \frac{e^{\beta v \cdot \sigma}}{ Z_\beta} where the normalization constant Z_\beta = \sum_{\sigma \in S'} e^{\beta v \cdot \sigma} is almost always impossible to compute in practice. However the ration
\frac{\pi(\sigma')}{\pi(\sigma)} = e^{\beta v \cdot (\sigma'-\sigma)}
does not involve Z_\beta and is easy to compute.
For this distribution we take as the proposal Q matrix used to generate a uniform distribution on the allowed states of the knapsack (see Knapsack problem) and the Metropolis algorithm now reads as follows. If X_n=\sigma then
- Choose j \in \{1, \cdots m\} at random.
- Set \sigma' \,=\, ( \sigma_1, \cdots, 1-\sigma_j, \cdots, \sigma_m).
- If \sigma'\cdot w >b (i.e. if \sigma \notin S' then X_{n+1}=\sigma.
- If \sigma'\cdot w >b then let \displaystyle \alpha \,=\, \min \left\{ 1 , \frac{ \pi(\sigma')}{\pi(\sigma)} \right\} \,=\, \min \left\{1, e^{\beta v \cdot (\sigma' - \sigma)} \right\} \,=\, \left\{ \begin{array}{cl} e^{- \beta v_j} & {\rm ~if~} \sigma_j=1 \\ 1 & {\rm ~if~} \sigma_j=0 \end{array} \right.
- Generate a random number U, If U \le \alpha then X_{n+1}=\sigma'. Otherwise X_{n+1}=\sigma.
In short, if you can add a book to your knapsack you always do so, while you remove a book with a probability which is exponentially small in the weight.
Glauber algorithm
Another algorithm which is widely used for Monte-Carlo Markov chain is the Glauber algorithm which appear in the literature under a variety of other names such as Gibbs sampler in statistical applications, logit rule in economics and social sciences, heat bath in physics, and undoubtedly under various other names.
The Glauber algorithm is not quite as general as the Metropolis algorithm. Indeed we assume that the state space S has the following structure
S \subset \Omega^V
where both \Omega and V are finite sets. For example S \subset \{0,1\}^m in the case of the knapsack problem or S \subset \{ 1, \cdots, q\}^V for the case of the proper q-coloring of a graph. We denote by
\sigma \,=\, \{ \sigma(v) \}_{v \in V} \,, \quad \sigma(v) \in \Omega \,.
the elements of S.
It is useful to introduce the notation
\sigma_{-v} \,=\, \{ \sigma(w) \}_{w \in V, w\not =v}
and we write
\sigma \,=\, ( \sigma_{-v}, \sigma(v) ) \,.
to single out the v entry of the vector \sigma.
Theorem 12.2 (Glauber algorithm) Let \pi be a probability distribution on S\subset \Omega^V and extend \pi to \Omega^V by setting \pi(\sigma)=0 if \sigma \in \Omega^V \setminus S. If X_n \,=\, \sigma then
- Choose v \in V at random.
- Replace \sigma(v) by a new value a \in \Omega (provided ( \sigma_{-v}, a) \in S) with probability
\frac{ \pi( \sigma_{-v}, a) } {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} \,.
The Glauber algorithm defines a Markov chain on S which satisfies detailed balance with stationary distribution \pi.
The irreducibility of the algorithm is not guaranteed a-priori and needs to be checked on a case-by-case basis.
Proof. The transition probabilities are given by
P(\sigma, \sigma') =
\left\{ \begin{array}{cl}
\frac{1}{ |V|} \frac{ \pi( \sigma_{-v}, \sigma'(v)) } {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} & \textrm{ if } \sigma_{-v} = \sigma'_{-v} {\rm~for~some~}v \\
0 & \textrm{ if } \sigma_{-v} \not= \sigma'_{-v} {\rm~for~all~}v \\
1 - \sum_{\sigma'} P(\sigma, \sigma') & \textrm{ if } \sigma=\sigma'
\end{array} \right.
To check detailed balance we note that if P(\sigma, \sigma')\not=0
\pi(\sigma) P(\sigma, \sigma') \,=\, \frac{\pi(\sigma) \pi(\sigma')} {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} \,,
and this is symmetric in \sigma and \sigma'. \quad \blacksquare.
Ising model on a graph
Let G=(E,V) be a graph and let S= \{-1,1\}^V. That is to each vertex assign the value \pm 1, you can think of a magnet at each vertex pointing either upward (+1) or downward (-1). To each \sigma \in S we assign an “energy” H(\sigma) given by
H(\sigma) \,=\, - \sum_{ e=(v,w) \in E} \sigma(v) \sigma(w) \,.
The energy \sigma is minimal if \sigma(v) \sigma(w) =1 i.e., if the magnets at v and w are aligned. Let us consider the probability distribution
\pi_\beta (\sigma) \,=\, \frac{e^{-\beta H(\sigma)} }{Z_\beta} \,, \quad Z_\beta \,=\, \sum_{\sigma} e^{-\beta H(\sigma)} \,.
The distribution \pi_\beta is concentrated around the minima of H(\sigma). To describe the Glauber dynamics note that
H ( \sigma_{-v}, 1) - H( \sigma_{-v}, -1) \,=\, -2 \sum_{ w\,;\, w \sim v} \sigma(w)
and this can be computed simply by looking at the vertices connected to v and not at all the graph. So the transition probabilities for the Glauber algorithm are given by picking a vertex at random and then updating with probabilities
\frac{ \pi( \sigma_{-v}, \pm 1)} { \pi( \sigma_{-v}, 1) + \pi( \sigma_{-v}, -1) } \,=\, \frac{1}{ 1 + e^{\pm \beta \left[ H ( \sigma_{-v}, 1) -
H( \sigma_{-v}, -1)\right]}} \,=\, \frac{1}{1 + e^{\mp 2 \beta \sum_{ w\,;\, w \sim v} \sigma(w)}} \,.
By comparison for the Metropolis algorithm we pick a vertex at random and switch \sigma(v) to -\sigma(v) and accept the move with probability
\min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\,
\min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\, \min \left\{ 1, e^{ 2\beta \sum_{ w\,;\, w \sim v}
\sigma(w)\sigma(v)} \right\} \,.
Simulated annealing
- Consider again the problem of finding the minimum of a function f(j), that is
f^*= \min\{f(j), j \in S\}
and we denote by S^*\subset S the location of the global minima. You should think of f as a complicated (and non-convex) function with complicated level sets and various “local” minima.
- To perform this task we sampling a distribution of the form
\pi_T(j) = e^{-\frac{f(j)}{T}}
which concentrates on the minima of f(j), and the more so as T \to 0.
The idea of simulated annealing comes from physics. The concept of annealing in physics is to obtain a low energy state of a solid (typically a crystal) you first heat it up to reach a liquid state and then, slowly, decrease the temperature to let the particles arrange themselves.
For a Markov chain the idea is to pick a temperature schedule
T_1 > T_2 > T_3 > \cdots \quad \textrm{ with } \lim_{k \to \infty} T_k =0
with T_1 sufficiently large.
The following algorithm is a Markov chain with nonstationary transition probabilities.
Simulated annealing algorithm:
+ Initialise the Markov chain X_0 and the temperature T_1.
+ For each k run N_k steps of the Metropolis or Gibbs sampler with invariant distribution \pi_{T_k}.
+ Update the temperature to T_{k+1} starting with the final configuration.
A nice result about Metropolis sampler can be found for example in Hajek (1988) (a more precsie version is given there)
Theorem 12.3 (Convergence of simulated annealing) For the simulated annealing of the Metropolis algorithm (N_k=1) there exists a constant d^* such that we have
\lim_{n \to \infty} P\{X_n \in S^*\} = 1
if and only if
\sum_{k=1}^\infty e^{-\frac{d^*}{T_k}} =\infty
The constant d^* measure the depths of the local minima of f where locality is measure in terms of the proposal matrix Q.
The theorem tells us that if decrease the temperature very very slowly, e.g. like
T_k = \frac{c}{\log(k)}
with c> d^* then the Markov chain will converge, with probabilty 1, to a minima.
This an extremely slow cooling schedule which makes it not very practical and other schedules are used like T_{k+1} = 0.99 T_k.
Parallel tempering
- The idea of parallel tempering is not use a temperature schedule but rather to use several copies of the system running at different temperatures. The copies at high temperature will explore the state space more efficiently and there is a switching mechanism which exchange the different copies so that the lower temperature copies can take advantage of the exploration done at high temperature.
The state space is the product of k copies of S
S^{(K)} = \underbrace{S \times \cdots \times S}_{\textrm{k times}}
and we denote a state by the vector \mathbf{i} = (i_1, \cdots, i_K).
One picks a set of temperatures T_1 < T_2 < \cdots < T_k and a probability distribution
\pi^{(k)} ({\bf i}) =\prod_{l=1}^K \pi_{T_l}(i_l)= \prod_{l=1}^K Z_{l} e^{- \frac{f(i_l)}{T_l}}
which is the product of the distribution at different temperatures.
The parallel tempering consists if two kinds of moves. The parallel move update each component of \mathbf{X}_n independently with the Metropolis algorithms at different temperature and there is a swapping move which exchanges a pair of components of the state vector \mathbf{X}_n in such a way as not to disturb the invariant measure. The component at the lowest temperature can be used to find the desired minimum.
Parallel tempering algorithm:
+ Suppose the state \mathbf{X}=\mathbf{i}, pick a random number U. + If U < \alpha then we do the parallel step and update each compoent of \mathbf{X} according to a Metropolis move at the corresponding temperatures T_l.
+ If U \ge \alpha we do the swapping step. We randomly chose a neighboring pair l,l+1 and propose to swap the components X_{n,l} and X_{n,l+1}. We accept the swap with probability
\min\left\{
1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})}
\right\}
The parallel moves clearly satisfy satisfy the detailed balance since each component does. As for a swap move which swaps the component i_l and i_{l+1} n the state vector, we also have detailed balance since
\begin{aligned}
& \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) (1-\alpha) \frac{1}{K-1}
\min\left\{
1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})}
\right\}
\\
& = (1-\alpha) \frac{1}{K-1}
\min \left\{
\pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) , \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l )
\right\}
\end{aligned}
which is symmetric in i_l,i_{l+1}.
Total variation norm
Given two probability measure \mu and \nu on S the total variation distance between \mu and \nu is given by
\|\mu - \nu\|_{\textrm{TV}}= \sup_{A \subset \Omega}|\mu(A) - \nu(A)|
that is the largest distance between the measures of sets.
The supremum over all set is not convenient to compute but we have the formula
Theorem 13.1 We have the formula
\begin{aligned}
\|\mu - \nu\|_{\textrm{TV}} &= \sum_{i: \mu(i) \ge \nu(i)} (\mu(i) - \nu(i))(x)
= \frac{1}{2} \sum_{i} | \mu(i) - \nu(i)|
\end{aligned}
\tag{13.1}
Proof. First note that the second equality in Equation 13.1 follows from the first. Indeed, if the first equality holds, then by interchanging \mu and \nu we also have
\|\mu - \nu\|_{\textrm{TV}} =
\sum_{i: \mu(i) \ge \nu(i)} |\mu(i) - \nu(i)| =
\sum_{i: \nu(i) \ge \mu(i)} |\nu(i) - \mu(i)|
which proves the second equality.
To prove the first equality in Equation 13.1 we consider the set B=\{i:\mu(i) \ge \nu(i)\}. For any event A we have
\begin{aligned}
\mu(A) - \nu(A) = \sum_{i \in A } \mu(i)-\nu(i) \le \sum_{i \in A \cap B} \mu(i)-\nu(i) \le \sum_{i \in B} \mu(i)-\nu(i) =\mu(B)-\nu(B)
\end{aligned}
By interchanging the role of \mu and \nu we find
\begin{aligned}
\nu(A) - \mu(A) &\le \nu(B^c)-\mu(B^c) = \mu(B) -\nu(B)
\end{aligned}
and thus for any set A we have |\mu(A)-\nu(A)|\le \mu(B) - \nu(B). \quad\blacksquare
The total variation is also intimately related to the notion of coupling between probability measures. A coupling between the probability measure \mu and \nu is a probability measure q(i,j) on the product space S \times S such that
\sum_{j} q(i,j) = \mu(j) \quad \textrm { and } \quad \sum_{j} q(i,j) = \mu(j)
i.e. the marginals of q are \mu and \nu.
Coupling are nicely expressed in terms of random variables. We can think of q(i,j) has the (joint) pdf of the rv (X,Y) where X has pdf \mu and Y has pdf \nu.
There always exists a coupling since we can always hoose X and Y independent, i.e. q(i,j)=\mu(i)\nu(j).
On the oppposite extreme if \mu=\nu then we can pick X=Y has a coupling i.e q(i,i)=\mu(i)=\nu(i) and q(i,j)=0 if i\not=j.
Total variation and coupling
Theorem 13.2 (Coupling representation of total variation) We have
\|\mu -\nu\|_{\textrm{TV}}= \inf \left\{ P\{X \not =Y\}\,; (X,Y) \textrm{ coupling of } \mu \textrm { and } \nu \right\}
\tag{13.2}
Proof. We first prove an inequality:
\begin{aligned}
\mu(A)-\nu(A) &= P\{X \in A\} - P\{Y \in A\} \le P\{X \in A\}- P\{X \in A, Y \in A\}
\\& = P\{X \in A, Y \notin A\} \le P\{ X \not=Y\}
\end{aligned}
and thus \|\mu-\mu\|_{\textrm{TV}} \le \inf P \{ X \not=Y\}.
To show the equality we construct an optimal coupling.
Recall from Theorem 13.1 that \displaystyle \|\mu-\nu\|_{\textrm{TV}}=\sum_{\mu(i)\ge \nu(i)}\mu(i)-\nu(i) = \sum_{\nu(i)\ge \mu(i)}\nu(i)-\mu(i) (see regions A and B in Figure 13.1) and we set
p= \sum_{i} \mu(i) \wedge \nu(i) = 1 - \|\mu-\nu\|_{\textrm{TV}}
(see region C in Figure 13.1).
Consider now the coupling defined as follows. Pick a random number U
- If U \le p then let Z be the RV with pdf
\gamma_C(i)=\frac{ \mu(i) \wedge \nu(i) }{p}
and set X=Y=Z.
- If U>p then let X be the random variable with pdf \gamma_A(i) and Y be the random variable with pdf \gamma_B(i) where
\begin{aligned}
\gamma_A(i) =\left\{
\begin{array}{cl}
\frac{\mu(i)-\nu(i)}{\|\mu-\nu\|_{\textrm{TV}}} & \mu(i)\ge \nu(i) \\
0 & \textrm{otherwise}
\end{array}
\right.
\quad
\gamma_B(i) =\left\{
\begin{array}{cl}
\frac{\nu(i)-\mu(i)}{\|\mu-\nu\|_{\textrm{TV}}} & \nu(i)\ge \mu(i) \\
0 & \textrm{otherwise}
\end{array}
\right.
\end{aligned}
Since p \gamma_C + (1-p) \gamma_A = \mu and p \gamma_C + (1-p) \gamma_B = \nu this defines a coupling and we have P\{ X\not=Y\}=1-p =\|\mu-\nu\|_{\textrm{TV}}. \quad \blacksquare.
Coupling of Markov chains
A coupling of a Markov chain is a stochastic process (X_n,Y_n) with state space S\times S such that both X_n and Y_n are Markov chains with transition matrix P but with possibly different initial conditions.
A Markovian coupling is a coupling such that the joint (X_n,Y_n) is itself a Markov chain with some transition matrix Q((i,j), (k,l)) which must then satisfy
\sum_{k} Q((i,j), (k,l)) = P(j,l) \quad \textrm{ and } \sum_{l} Q((i,j), (k,l)) = P(i,k) \,.
Recall that Theorem 6.3 where we proved the convergence of aperiodic irreducible Markov chains to their stationary distribution we used an independent coupling. We will expand upon this idea by considering other more intersting couplings.
The coupling time is defined by
\sigma= \inf \{ n \ge 0 \,:\, X_n=Y_n \}
that is, it is the first time that the Markov chain visit the same state.
After the coupling time \sigma, X_n and Y_n have the same distribution (by the strong Markov property) so we can always modified a coupling so that, after time \sigma X_n and Y_n are moving together, i.e. X_n=Y_n for all n \ge \sigma. We will always assume this to be true in what follows.
Speed of convergence via coupling
Theorem 13.3 Suppose (X_n,Y_n) is a coupling of a Markov chain such that X_0=i and Y_0=j and \sigma is the coupling time. Then we have
\|P^n(i,\cdot) - P^n(j,\cdot)\|_{ \textrm{TV} } \le P \left\{ \sigma>n| X_0=i, Y_0=j\right\}
Proof. We have P\{X_n=l|X_0=i\}=P^n(i,l) and P\{Y_n=l|X_0=j\}=P^n(j,l) and therefore X_n and Y_n is a coupling of the the probability distributions P^n(i,\cdot) and P^n(j,\cdot). So by Theorem 13.2 we have
\|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}} \le P \left\{ X_n \not= Y_n | X_0=i,Y_0=j\right\} = P \left\{ \sigma > n | X_0=i,Y_0=j\right\}\,.
\blacksquare
We can use this result to bound the distance to the stationary measure
Theorem 13.4 We have
\sup_{i \in S } \| P^n(i,\cdot) - \pi\|_{\textrm{TV}} \le \sup_{i,j \in S } \|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}}
Proof. Using the stationarity and the triangle inequality we have
\begin{aligned}
\| P^n(i,\cdot) - \pi\|_{\textrm{TV}}
&= \sup_{A} |P^n(i,A) - \pi(A)| \\
& = \sup_{A}| \sum_{j} \pi(j) (P^n(i,A)-P^n(j,A))| \\
%& \le \sup_{A} \sum_{j} \pi(j) |P^n(i,A)-P^n(j,A)| \\
& \le \sum_{j} \pi(j) \sup_{A} |P^n(i,A)-P^n(j,A)| \\
& = \sum_{j} \pi(j) \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \\
& \le \sup_{i,j\in S} \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \quad \blacksquare
\end{aligned}
- We set d(n) = \sup_{i} \|P^n(i,\cdot) - \pi \|_{\textrm{TV}} which is the maximal distance to stationarity starting from an arbitrary initial state. (It is not hard to see that \|\mu P^n - \pi \|_{\textrm{TV}} \le d(n) for abitrary initial distribution \mu as well).
We define then the mixing time t_{\textrm{mix}}(\varepsilon) of a Markov chain to be
t_{\textrm{mix}}(\varepsilon) = \min\{ n, d(n) \le \varepsilon \}\,.
That is, if n>t_{\textrm{mix}}(\varepsilon) then \mu P^n is less than \varepsilon close to the stationary distribution.
Mixing time for the sucess run chain
We start with a countable state space example, the success run chain which is very easy to analyze.
Parenthetically, it should be noted that the supremum over i in d(n) is often not well suited for countable state space. It may often happen that the number of steps it take to be close to the stationary distribution may depend on where you start.
We consider a special case of Example 8.4 with constant succes probability.
P(n,0)=(1-p)\, P(n,n+1)= p\,, n=0,1,2,\cdots
Suppose X_0=i and Y_0=j (with i\not=j) we couple the two chains by moving them together.
Pick a random number U, if $U (1-p)) then set X_1=Y_1=0 the couling time \sigma=1.
If U \ge 1-p then $move X_1=i+1, Y_1=j+1.
Clearly we have
P( \sigma >n| X_0=i, Y_0=j) = p^n
and thus we find for the mixing time
d(n)=p^n < \varepsilon \iff n \ge \frac{\ln(p)}{\ln \varepsilon}
Mixing time for the random walk on the hypercube
Recall that for the random walk on the hypercube (see Example 5.4) a state is a d-bit \sigma=(\sigma_1, \cdots, \sigma_d) with \sigma_j\in \{0,1\}.
The Markov chain is periodic with period 2 so to make it aperiodic we consider its “lazy” version in which we do not move with probability 1/2, that is we consider instead the transition matrix \frac{P+I}{2} which makes it periodic. The Markov chain moves then as follows, we pick a coordinate \sigma_j at random and then replace by a random bit 0 or 1.
To couple the Markov chain we simply move move them together: If X_n=\sigma and Y_n=\sigma' we pick a coordinate j \in \{1,\cdots, d\} at random and then replace both \sigma_j and \sigma'_j by the same random bit. This is a coupling and after a move the chosen coordinates coincide.
Under this coupling X_n and Y_n will get closer to each other if we select a j such that \sigma_j \not= \sigma'_j and we will couple when all the coordinates have been selected. The distribution of the coupling time is thus exactly the same as the time \tau need to collect d toys in the coupon collector problem.
We now get a bound on the tail as follows. We denote by A_i the events that coordinate i has not been selected after m steps. We have, using the inequality (1-x)\le e^{-x}
\begin{aligned}
P\{\sigma > n\} = P\left(\bigcup_{i=1}^d A_i\right) \le \sum_{i=1}^d P(A_i) =
\sum_{i=1}^d \left(1- \frac{1}{d}\right)^n \le d e^{-\frac{n}{d}}
\end{aligned}
So if we pick n = d\ln(d) + c d we find
P\{\sigma > d\ln(d) + c d\} \le d e^{-\frac{d \ln(d) + cd}{d}}= e^{-c} \implies t_{\textrm{mix}}(\varepsilon) \le d\ln(d) + \ln(\varepsilon^{-1})d\,.
Definition of the Poisson process
Counting processes N_t are continuous time stochastic process which counts the number of events occuring up to time t:
N_t = \textrm{total number of events occuring up to time } t
e.g. the number of accidents reported to an insurance company, the number of earthquakes in California, and so on.
The Poisson process is the simplest counting process, intuitvely characterized by the following properties:
- The number of events occuring in disjoint time intervals are independent.
- The rate at which event occur is constant over time.
- The events occur one at a time.
More formally a Poisson process N_t with rate paratmeter \lambda is a continuous time stochastic process such that
Independent increments: Given times s_1\le t_1 \le s_2 \le t_2 \cdots \le s_n \le t_n the random variables X_{t_i}-X_{s_i} (that is the number of events occurring in the disjoint time intervals [s_i,t_i]) are independent.
The rate at which events occur is denoted by \lambda>0 and for any (small) time interval [t, t + \Delta t] we have
\begin{aligned}
P \{ N_{t + \Delta t } = N_t \} &= 1- \lambda \Delta t +o(\Delta t) & \\
P\{N_{t + \Delta t}= N_t +1\} &= \lambda \Delta t +o(\Delta t) &
\textrm{ with } \lim_{\Delta t \to 0}\frac{o(\Delta t)}{\Delta t}=0 \\
P\{N_{t + \Delta t}\ge N_t +2\} & = o(\Delta t) &
\end{aligned}
\tag{14.1}
Distribution of the Poisson process
Theorem 14.1 (Distribution of the Poisson process) The Poisson process N_t with N_0=0 has the distribution
P\{N_t=k\} = e^{-\lambda t}\frac{(\lambda t)^k}{k!}
i.e. N_t has Poisson distribution with paramter \lambda t. Morever for s, M_t=N_{t+s}-N_{s} is a Poisson process.
Proof (version 1 using Poisson limit). Pick a large number n and divide the interval [0,t] into n intervals of size \frac{t}{n}. Write
N_t = \sum_{j=1}^n\left( N_{j \frac{t}{n}}- N_{(j-1) \frac{t}{n}} \right)
as a sum of n independent random variables. If n is large the probability that any of these random variables is at least 2 is small. Indeed, by a union bound, we have
\begin{aligned}
P\left\{ N_{j \frac{st}{n}}- N_{(j-1) \frac{t}{n}} \ge 2 \textrm { for some } j \right\} &\le \sum_{j=1}^n P\left\{ N_{j \frac{t}{n}}- N_{(j-1) \frac{t}{n}} \ge 2 \right\}
\le n P \left\{ N_{\frac{t}{n}} \ge 2\right\} = t \frac{o( \frac{t}{n})}{\frac{t}{n}}
\end{aligned}
which goes to 0 as n \to \infty.
Therefore N_t is, approximately, a binomial random variables with success probability \lambda \frac{t}{n}:
P(N_t=k) \approx {n \choose k} \left(\frac{\lambda t}{n} \right)^k
\left(1-\frac{\lambda t}{n} \right)^{n-k}
and as n \to \infty this converges to a Poisson distribution with parameter \lambda t. \quad \blacksquare.
Proof (version 2 using ODEs). Let us derive a system of ODEs for the family P_t(k)=P\left\{ N_t=k \right\}. We have
\frac{d}{dt}P_t(k) = \lim_{\Delta t \to 0} \frac{P\left\{ N_{t+\Delta t}=k \right\} - P\left\{ N_t=k \right\} }{\Delta t}
Conditioning we find
\begin{aligned}
P\left\{ N_{t+\Delta t}=k \right\}=&
P\left\{ N_{t+\Delta t}=k | N_t=k \right\} P\left\{ X_{t}=k \right\} \\
& + P\left\{ N_{t+\Delta t}=k | N_t=k-1 \right\}P\left\{ X_{t}={k-1} \right\} \\
& + P\left\{ N_{t+\Delta t}=k | N_t\le k-2\right\}P\left\{X_t \le k-2\right\} \\
=& P_t(k) (1 - \lambda \Delta t) + P_t(k-1) \Delta t + o(\Delta t)
\end{aligned}
and this gives the system of equations
\begin{aligned}
\frac{d}{dt}P_t(0)&= - \lambda P_t(0) & \quad P_0(0)=1 \\
\frac{d}{dt}P_t(k)&= \lambda P_t(k-1) - \lambda P_t(k) &\quad P_0(k)=0
\end{aligned}
We find P_0(t)= e^{-\lambda t} for k=0. We use an integrating factor and set f_t(k)=e^{\lambda t} P_t(k). We have then f_t(0)=1 and for k >0
\frac{d}{dt}f_t(k)=\lambda f_t(k-1), \quad f_0(k)=0
which we can solve iteratively to find
f_t(k) = \frac{(\lambda t)^k}{k!}
and thus N_t has distribution
P_t(k)= e^{-\lambda t}\frac{(\lambda t)^k}{k!}\,.
\blacksquare.
Poisson process and exponential random variables
- Poisson processes and exponential (and gamma) random variables are intimately related. Given a Poisson process we consider the interarrival times T_1, T_2, \cdots: T_1 is time of the occurence of the first event, T_2 is the time elapsed between the first and second event, and so on.
Theorem 14.2 If N_t is a Poisson process with parameter \lambda the interarrival times are independent exponential random variables with paramter \lambda.
Proof. If T_1 >t it means no event has occured up time t and so N_t=0. Therefore
P\left\{ T_1 >t \right\} = P\left\{ N_t =0\right\} = e^{-\lambda t}
and thus T_1 has an exponential distribution. For T_2 we condition on T_1
P\left\{ T_2 >t \right\} = \int P(T_2>t |T_1 =s) f_{T_1}(s) ds
and, using the independence of the increments,
P\left\{ T_2>t |T_1 =s\right\} = P\left\{ 0 \textrm{ events } in (s, s+t] | T_1=s \right\} = P\left\{ 0 \textrm{ events } in (s, s+t] \right\} = e^{-\lambda t}
from which we conclude that T_2 has exponential distribution and is independent of T_1. This argument can be repeated for T_3 by conditioning on the time of the second event, T_1+T_2, and so on. \quad \blacksquare
Another set of closely related quantities are the arrival times of the n^{th} event S_1, S_2, \cdots which are related to the interarrival times by
S_1=T_1, \quad S_2=T_1+T_2, \quad S_3=T_1+T_2+T_3, \cdots
By Theorem 14.2 S_n is the sum of n independent exponential RVs and thus S_n a Gamma RV with parameter (n,\lambda) with density
f_{S_n}(t) = \lambda e^{-\lambda t}\frac{(\lambda t)^{n-1}}{(n-1)!}\, \quad \textrm{ for }t \ge 0\,.
We can actually prove this fact using the Poisson process by noting that, by definition,
N_t \ge n \iff S_n \le t \,,
that is if n or more events have occured by time t if and only the n^{th} event has occured prior to or at time t.
So the CDF of S_n is
F_{S_n}(t)=P\{ S_n \le t\} = P \{ N_t \ge n\} = \sum_{j=n}^\infty e^{-\lambda t} \frac{(\lambda t)^{j}}{j!}
and upon differentiating we find
f_{S_n}(t)= - \lambda e^{-\lambda t} \sum_{j=n}^\infty \frac{(\lambda t)^{j}}{j!} + \lambda e^{-\lambda t}\sum_{j=n}^\infty \frac{(\lambda t)^{j-1}}{(j-1)!} = \lambda e^{-\lambda t} \frac{(\lambda t)^{n-1}}{(n-1)!}. \quad \blacksquare
By the independence of the interarrival times proved in Theorem 14.2 the conditional density is given by
\begin{aligned}
f(s_1, s_2, \cdots, s_n|n) &=
\frac{f(s_1, s_2, \cdots, s_n, n)}{P\{N_t=nxw\}} \\
&= \frac{\lambda e^{-\lambda s_1} \lambda e^{-\lambda(s_2-s_2)} \cdots \lambda e^{-\lambda(s_n-s_{n-1})} e^{-\lambda(t-s_n)} }{e^{-\lambda t} \frac{(\lambda t)^n}{n!} } = \frac{n!}{t^n}
\end{aligned}
which is the joint density of the order statistic of n uniform. \blacksquare
- Recall if X_1, \cdots, X_n are IID random variable with joint density f(x_1)\cdots f(x_n) and X^{(1)} \le X^{(2)} \le \cdots X^{(n)} the order statistics, then the joint pdf of X^{(1)}, \cdots ,X^{(n)} is given by
g(x_1,\cdots, x_n)) = \left\{
\begin{array}{cl}
n! f(x_1) \cdots f(x_n) &\textrm{ if } x_1 \le x_2 \le \cdots x_n \\
0 & \textrm{ else }
\end{array}
\right.
Simulation of Poisson process
The characterization of the Poisson process in terms of exponential random variables suggest immediately a very simple algorithm to simulate N_t.
Simulate independent exponential RVs T_1, T_2, \cdots with parameter \lambda and set N_t=0 for 0\le t <T_1, N_t=1 for T_1\le t <T_1+T_2, and so on.
Simulation of a Poisson random variable
It is not easy to simulate directly a Poisson random variable X from its pdf/cdf but we can do it elegantly using its relation with exponential random variable. To do this generate independent exponential random variable until they sum up to 1 (so as to generate X=N_1) and use the relation between exponential and uniform.
\begin{aligned}
X =n & \iff \sum_{k=1}^n T_k \le 1 < \sum_{k=1}^{n+1} T_k
\iff \sum_{k=1}^n - \frac{1}{\lambda} \ln(U_k) \le 1 < \sum_{k=1}^{n+1} - \frac{1}{\lambda} \ln(U_k) \\
& \iff \ln( \prod_{k=1}^n U_k) \ge - \lambda > \ln( \prod_{k=1}^{n+1} U_k)
\iff \prod_{k=1}^n U_k \ge e^{-\lambda} > \prod_{k=1}^{n+1} U_k
\end{aligned}
Algorithm to simulate a Poisson random variable with parameter \lambda: Generate random numbers until their product is smaller than e^{-\lambda}.
- Generate random number U_1, U_2, \cdots.
- Set X=n if n+1 = \inf \left\{ j \,:\, \prod_{k=1}^{j} U_j < e^{-\lambda} \right\}
Long-time behavior of the Poisson process
We investigate the behavior of N_t for large time. We prove a CLT type result, namely that
\lim_{t \to \infty} \frac{N_t-\lambda t}{\sqrt{\lambda t}} = Z \textrm{ in distribution}
where Z is a standard normal RV.
Recall that the characteristic function of a Poisson RV Y with parameter \mu is E[e^{i\alpha Y}]= e^ {\mu (e^{i\alpha} -1)}. Therefore
E\left[ e^{i \alpha \frac{N_t - \lambda t}{\sqrt{\lambda t}}} \right] = e^{\lambda t \left( e^{i \frac{\alpha}{\sqrt{\lambda t}}} -i \frac{\alpha}{\sqrt{\lambda t}} -1\right)}
Expanding the exponential we have \lambda t \left(e^{i \frac{\alpha}{\sqrt{\lambda t}}} -i \frac{\alpha}{\sqrt{\lambda t}} -1\right) = - \frac{\alpha^2}{2} + O( \frac{1}{\sqrt{\lambda t}}) and thus \lim_{t \to \infty} E\left[ e^{i \alpha \frac{N_t - \lambda t}{\sqrt{\lambda t}}} \right]=e^{-\frac{\alpha^2}{2}}.
The same computation shows also that, for any fixed t, \lim_{\lambda \to \infty} \frac{N_t-\lambda t}{\sqrt{\lambda t}} = Z \textrm{ in distribution} since rescaling the parameter is equivalent to rescaling time.
Sampling a Poisson process
- We can sample or split a Poisson process. Suppose that every event of a Poisson process (independently of the other events) comes into two different types, say type 1 with probability p and type 2 with probability q=1-p.
Theorem 14.4 Suppose N_t is a Poisson process with paramter \lambda and that every event (independently) is either of type 1 with probability 1 or type 2 with probability q=1-p. Then N^{(1)}_t, the number of events of type 1 up to time t, and N^{(2)}_t, the number of events of type 2 up to time t, are independent Poisson process with rate \lambda p and \lambda(1-p).
Proof. We check that N^{(1)}_t satisfy the definition of a Poisson process and the use Theorem 14.1
N^{(1)}_0=0 and N^{(1)}_t has independent increments since N_t has independent increments and events are classified of type 1 and 2 independently of each other.
We have
\begin{aligned}
P \left\{ N^{(1)}_{t + \Delta t}= N^{(1)}_t +1 \right\} & =
P\{ N_{t + \Delta t}= N_t +1 \textrm{ and the event is of type } 1 \} \\
& + P\{ N_{t + \Delta t} \ge N_t +2 \textrm{ and exactly one event is of type } 1 \} \\
&= \lambda \Delta t \times p + o(\Delta t)
\end{aligned}
\begin{aligned}
P \left\{ N^{(1)}_{t + \Delta t}= N^{(1)}_t \right\} & =
P\{ N_{t + \Delta t}= N_t \}
+ P\{ N_{t + \Delta t} = N_t + 1 \textrm{ and the event is of type 2} \} \\
& + P\{ N_{t + \Delta t} \ge N_t + 2 \textrm{ and no event of type $1$} \} \\
&= (1- \lambda \Delta t) + \lambda \Delta t (1-p) + o(\Delta t) =1-\lambda \Delta t p + o(\Delta t)
\end{aligned}
P \left\{ N^{(1)}_{t + \Delta t}\ge N^{(1)}_t +2 \right\} =o(\Delta t)
Finally to show that N^{(1)}_t and N^{(2)}_t are independent we compute their joint PDF by conditioning on the value of N_t and find
\begin{aligned}
P\left\{N^{(1)}_t =n , N^{(2)}_t =m \right\} &= P\left\{N^{(1)}_t =n , N^{(2)}_t =m | N_t={n+m}\right\} P\left\{N_t={n+m}\right\} \\
&= {n+m \choose n} p^n (1-p)^m e^{-\lambda t} \frac{(\lambda t)^{n+m}}{(n+m)!} \\
&= e^{-\lambda p t} \frac{ (\lambda p t )^n}{n!} e^{-\lambda (1-p) t} \frac{(\lambda(1-p) t)^{m}}{m!}
\end{aligned}
\blacksquare
The coupon collecting problem
We revisit the couplon collector but we relax the assumption that all the toys are equally probable. We assume that any box contains toy i with probability p_i. How do we compute now the expected number of boxes needed to collect all the M toys? The argument used earlier does not generalize easily.
We use the following trick or radomizing the time between boxes. Instead of collecting boxes at fixed time interval, we collect them at times which are exponentially distributed with paramter 1. Then the number of boxes collected up to time t a Poisson process N_t with rate \lambda=1 (on average it takes the same time to get a new box). We have now M types of events (getting a box with toy i) and we split the poisson process accordingly. Then by Theorem 14.4 the number of toys of type i collected up to time t, N^{(i)}_t is a Poisson process with rate \lambda p_i=p_i and the Poisson processes N^{(i)}_t are independent.
We now consider the times
T^{(i)} = \textrm{ time of the first event for the process } N^{(i)}_t
that is the time where the first toy of type i is collected. The times T^{(i)} are independent since the underlying Poisson processes are independent, and are exponential with paramter p_i. Furthermore
S = \max_{i} T^{(i)} = \textrm{ time until one toy of each type has been collected}.
By independence we have
P\left\{ S \le t \right\} = P\left\{ \max_{i} T^{(i)} \le t\right\} = \prod_{i=1}^M P\left\{ T^{(i)} \le t \right\} = \prod_{i=1}^M (1- e^{-p_it})
Thus
E[S]= \int_0^\infty P\{S \ge t\} dt = \int_0^\infty ( 1- \prod_{i=1}^M (1- e^{-p_it}) )\, dt
- Finally we relate S to the original question. If X is the number of box needed to collect all the toys then we have
S = \sum_{k=1}^X S_k
where S_k aree IID exponential with parameter 1. But conditioning it is easy to see that
E[S]= E[N]E[S_1] = E[N]
and we are done.
Poisson process with variable rate
We can generalize the Poisson process by making the rate \lambda(t) at which event occur depend on time: a nonhomogeneous Poisson process N_t with rate paratmeter \lambda(t) is a continuous time stochastic process such that
Independent increments: Given times s_1\le t_1 \le s_2 \le t_2 \cdots \le s_n \le t_n the random variables N_{t_i}-N_{s_i} (that is the number of events occurring in the disjoint time intervals [s_i,t_i]) are independent.
We have
\begin{aligned}
P \{ N_{t + \Delta t } = N_t \} &= 1- \lambda(t) \Delta t +o(\Delta t) & \\
P\{N_{t + \Delta t}= N_t +1\} &= \lambda(t) \Delta t +o(\Delta t) &
\textrm{ with } \lim_{\Delta t \to 0}\frac{o(\Delta t)}{\Delta t}=0 \\
P\{N_{t + \Delta t}\ge N_t +2\} & = o(\Delta t) &
\end{aligned}
\tag{14.2}
One way to construct a nonhomgeneous Poisson process is by sampling it in a time-dependent manner. Suppose \lambda(t) is bounded (locally in t), then we pick \lambda > \lambda(t). We consider a Poisson process M_t with constant rate \lambda, and if an event occurs at time t then we decide to keep this event with probability p(t)=\frac{\lambda(t)}{\lambda} and we discard the event with probability 1-p(t). By the same argument we used in the section Sampling a Poisson process we see the number of kept events satisfies the definition of a non-homogeneous Poisson process in Equation 14.2
Let us consider an event for the process M_t which occured in the interval [0,t]. By our analysis of arrival time we know that this event occured a time which is uniformly distributed on the interval [0,t]. Therefore the probability that this event was accepted and contribute to N_t is therefore
p_t = \frac{1}{t}\int_0^t \frac{\lambda(s)}{\lambda} \, ds
By repeating then the second part of the arguement in the section Sampling a Poisson process we see that M_t has a Poisson distribution with parameter
\lambda t p_t = \int_0^t \lambda(s) \, ds
and in particular
E[N_t] = \int_0^t \lambda(s) \, ds
Queueing model with infinitely many servers
Assume that the the flow of customers entering an onine store follows a Poisson process N_t with rate \lambda. The time S spent in the store for a single customer (browsing around, checking out, etc..) is given by tis CDF G(t)=P\{ S \le t\} and we assume that the customers are independent of each other.
To figure out how to allocate ressources one wants to figure out what is number of customers, M_t, which are still in the sytem at time t.
To find the distribution of M_t let us consider one of the customer by time t. If he arrived at time s\le t then he will have left the system at time t with probability G(t-s) and will still be in the system by time t with probability 1-G(t-s). Since the arrival time of that customer is uniform on [0,t] the distribution of M_t is Poisson with mean
E[M_t] = \int_0^t (1- G(t-s))ds = \lambda \int_0^t (1 - G(s))ds,.
For large t, we see that E[M_t] \approx \lambda E[S].
Compound Poisson process
Suppose that the number of claims receieved by an insurance follows a Poisson process. The size of each claim will be different and it natural to assume that claims are independent from each other. If we look at the total claims incurred by the insurance company this leads to a stochastic process called a compound Poisson process.
A stochastic process X_t is called a compound Poisson process if it has the form
X_t = \sum_{k=1}^{N_t} Y_k
where N_t is a Poisson process and Y_1, Y_2,\cdots are IID random variables which are also independent of N_t.
The process X_t has stationary independent increments. Using that N_t-N_s is a Poisson process
X_t-X_s = \sum_{k=N_s}^{N_t} Y_k \textrm{ has the same distribution as } X_{t-s}= \sum_{0}^{N_{t-s}} Y_k
We can compute the MGF of X_t (or its charactersitic function) by conditining on N_t. Suppose m_Y(\alpha)=E[e^{\alpha Y}] is the moment generating function of Y and using the MGF for the Poisson RV we find
\begin{aligned}
m_{X_t}(\alpha)& =E\left[ e^{\alpha X_t}\right] = E\left[ e^{\alpha \sum_{k=1}^{N_t} Y_k} \right] = \sum_{n=0}^\infty E\left[ e^{\alpha \sum_{k=1}^{N_t} Y_k} | N_t =n \right] P\{n_t=n\} \\
& = \sum_{n=0}^\infty m(\alpha)^n P\{n_t=n\} = e^{\lambda t ( m(\alpha) -1 )}
\end{aligned}
We can compute then the mean and variance
\begin{aligned}
m_{X_t}'(\alpha)&= e^{\lambda t ( m(\alpha) -1 )} \lambda t m'(\alpha) \\
m_{X_t}''(\alpha)&= e^{\lambda t ( m(\alpha) -1 )} (\lambda t m''(\alpha) + \lambda t)^2() m'(\alpha)^2 )
\end{aligned}
and thus
E[X_t]= \lambda t E[Y]
\quad \textrm{ and } Var[X_t]= \lambda t ({\rm Var}(Y) + E[Y]^2])
With a bit more work we could prove a central limit theorem.
Continuous time Markov chains
In this section that we build a continuous time Markov process X_t with t\ge 0. The Markov property can be expressed as
P\{ X_t=j| \{X_{r}\}, 0\le r \le s \} = P\{X_t=j| X_s \} \,.
for any 0 < s < t.
Exponential random variables
To construct a Markov we will need lots of exponential random variables. Recall that an exponential random variable T with paramter \lambda has the pdf f_T(t)=\lambda e^{-\lambda t}, for t \ge 0 the cdf F_T(t) = 1- e^{-\lambda t} and mean E[T]=\frac{1}{\lambda}.
Proposition 15.1 (Properties of exponential randomm variables) Let T_1, T_2, T_3, \cdots be independent exponential random variables with parameter \lambda_1, \lambda_2, \cdots. Then
- T= \min\{ T_1, \cdots, T_n\} is an exponential random variables with paramter \lambda_1+ \cdots \lambda_n. Note that n=\infty is allowed if we assume that \sum_{n}\lambda_n is finite.
- \displaystyle P\{ T_i = \min \{ T_1, \cdots, T_n\} \} = \frac{\lambda_i}{\lambda_1+ \cdots \lambda_n}
Proof. For 1. we have, using independence,
P\left\{ T > t \right\} = P\left\{ T_1 > t, \cdots, T_n > t \right\} = P\left\{ T_1 > t \right\} \cdots P\left\{ T_m > t \right\}
= e^{-(\lambda_1+ \cdots \lambda_n)t}
and thus T is an exponential random variable.
For 2. we have, by conditioning,
P\left\{ T_1 = T \right\} = \int_0^\infty P\{ T_2>t, \cdots, T_n > t \} f_{T_1}(t) \, dt = \int_0^\infty e^{-(\lambda_2 + \cdots+ \lambda_n)t} \lambda_1 e^{-\lambda_1 t}\, dt = \frac{\lambda_1}{\lambda_1+ \cdots \lambda_n}
\blacksquare
Definition of a continuous time Markov chain
- As for the Poisson process we will give two equivalent definition of the process, the first one describe infinitesinmal rate of change and leads to a system of ODEs describing the evolution of the pdf of X_t. The second definition use exponential random variables and waiting times and leads to an algorithm to simulate a continuous time Markov chain, often called the stochastic simulation algorithm.
- To define a Markov process on the state space S we assign a number \alpha(i,j) for any pair of state i\not=j. You should think these numbers
\alpha(i,j) = \textrm{ rate at which the chain changes from state } i \textrm{ to state } j \,.
We denote
\alpha(i) = \sum_{j\not= i} \alpha(i,j) = \textrm{ rate at which the chain changes from state } i \,.
- Formally a continuous Markov chain with rates \alpha(i,j) is a stochastic process such that
\begin{aligned}
P\{ X_{t+\Delta t}=i| X_t=i\} &= 1 - \alpha(i)\Delta t + o(\Delta t) \\
P\{ X_{t+\Delta t}=j| X_t=i\} &= \alpha(i,j)\Delta t + o(\Delta t)
\end{aligned}
- Proceeding as for the Poisson process we can derive a differential equation for p_t(i)=P\{X_t =i\}. By conditioning we have
P\{ X_{t+\Delta t}=i\} = (1 - \alpha(i)\Delta t )P\{ X_t=i\} + \sum_{j \not =i} \alpha(j,i)\Delta t P\{ X_t=j\} + o(\Delta t)
which leads to the system of linear ODE’s
\frac{d}{dt}p_t(i) = -\alpha(i) p_t(i) + \sum_{j \not= i} \alpha(j,i) p_t(j)
\tag{15.1} called the Kolmogorov backward equations.
- The infinitesimal generator of a continuous-time Markov chain is given by the matrix
A =
\begin{matrix}
1 \\ 2 \\ 3 \\ \vdots
\end{matrix}
\begin{pmatrix}
-\alpha(1) & \alpha(1,2) & \alpha(1,3) & \cdots \\
\alpha(2,1) & -\alpha(2) & \alpha(2,3) & \cdots\\
\alpha(3,1) & \alpha(3,2) & -\alpha(3) & \cdots\\
\vdots & \vdots & \vdots &
\end{pmatrix}
and the entries of A satifies
A(i,j)\ge 0 \textrm{ for } i \not= j \quad \textrm{ and } \sum_{i} A(i,i) = 0
If we use a row vector p_t = (p_t(1), p_t(2), \cdots) then we can rewrite Equation 15.1 as the system
\frac{d}{dt}p_t = p_t A
\tag{15.2}
If S is finite then one can write the solution in terms of the matrix exponential e^{tA}= \sum_{k=0}^\infty \frac{t^k A^k}{k!}
p_t = p_0 e^{tA} \,.
where p_o(i)=P\{X_0=i\} is the initial distribution.
We can also write equation for the transition probabilities (take X_0=i)
P_t(i,j) = P \{ X_t =j| X_0=i\}
and we obtain the matrix equation
\frac{d}{dt}P_t = P_t A, \quad \textrm{ with } P_0 =I \quad \implies \quad P_t = e^{tA}
which we can solve using linear algebra techniques
Example
A =
\begin{matrix}
1 \\ 2 \\ 3 \\ 4
\end{matrix}
\begin{pmatrix}
-1 & 1 & 0 & 0 \\
1 & -3 & 1 & 1 \\
0 & 1 & -2 & 1 \\
0 & 1 & 1 & -2
\end{pmatrix}
The (right) eigenvalues of A are 0, -1, -3, -4 and we can diagonalize A and find
D=\begin{pmatrix}
0 & 0 & 0 & 0 \\
0 & -1 & 0 & 0 \\
0 & 0 & -3 & 0 \\
0 & 0 & 0 & -4 \\
\end{pmatrix}
= Q^{-1} A Q
where the rows of Q are the corresponding eiegenvectors:
Q=\begin{pmatrix}
1& 1 & 0 & -\frac13 \\
1& 0 & 0 & 1 \\
1& -\frac12 & -\frac12 & -\frac13 \\
1& -\frac12 & \frac12 & \frac13 \\
\end{pmatrix} \quad \quad
Q^{-1}=\begin{pmatrix}
\frac14 & \frac14 & \frac14 & \frac14 \\
\frac23& 0 & -\frac13 & -\frac13 \\
0& 0 & -1 & -1 \\
-\frac14& \frac34 & -\frac14 & -\frac14 \\
\end{pmatrix}
Then we find
P_t = e^{tA} = Q e^{tD} Q^{-1} =
Alternative description
We can also construct the Markov chain X_t using exponential random variable.
If X_t=i consider (independent) random variable T(i,j) with parameter \alpha(i,j) for j \not =i. Think of those as exponential clocks and when the first of the clocks rings, say clock j, then the Markov chain moves to state j.
Using Proposition 15.1 we see that the chain moves from the state i to j if the clocks associated to the pair (i,j) rings first and this occur with probability
Q(i,j) = \frac{\alpha(i,j)}{\sum_{j \not= i}\alpha(i,j)} = \frac{\alpha(i,j)}{\alpha(i)}
The matrix Q(i,j) is the transition matrix for a discrete time Markov chain Y_n which has the same junmps probabilities as X_t but the transition occur at each unit time.
Then a new set of clocks with rates \alpha(j,k) is produced and the process starts anew until the Markov chain moves to a new state.
By Proposition 15.1 we can express this slightly differently. If in state i we use a single clock with rate \alpha(i) and when this clock rings then we move to state j according to the discrete time Markov chain Q(i,j).
The Markov property follows form the memoryless property for an exponential distribution T: P(T\ge t +s| T \ge s) = P(T>t).
If X_t=i then by construction the position after the next jump after time t clearly depends only i and not the states that the Markov chain visited before time t. Moreover if we consider the time of the last jump before time t which occured, say at time s<t, then the memoryless property of the exponential random variable implies that the time at which the jump occur after time t does not depend on s at all. Putting these together this implies the Markov property
P\{X_{t+u} =j | X_{s}, 0 \le s \le t\} = P\{X_{t+u} =j | X_t \}
To connect this to the previous description we note that by conditioning on the first jump
P_t(i,j) = P(X_t=j|X_0=i) = \delta(i,j) e^{-\alpha(i) t} + \int_0^t \alpha(i) e^{-\alpha(i)s} \sum_{k} {Q(i,k)} P_{t-s}(k,j) ds
Iterating this equation and setting t=\Delta t we find
\begin{aligned}
P(X_{\Delta t}=j|X_0=i) &= \delta(i,j) e^{-\alpha(i) \Delta t} + \int_0^t \alpha(i) e^{-\alpha(i)s} \sum_{k} \frac{\alpha(i,k)}{\alpha(i)} \delta(k,j) e^{-\alpha(k)(t-s)} ds + \cdots \\
&= \delta(i,j) e^{-\alpha(i) \Delta t} + \alpha(i,j) \Delta t \times \underbrace{ e^{-\alpha(j) \Delta t}
\frac{1}{\Delta t} \int_0^{\Delta t} e^{-(\alpha(i)-\alpha(j)) s} ds
}_{\to 1 \textrm{ as }\Delta t \to 0} + \cdots \\
&= \delta(i,j) (1 - \Delta t \alpha(i)) + \alpha(i,j) \Delta t + o(\Delta t)
\end{aligned}
The higher order terms are negligible since the probability to have two jumps in the time interval [0,\Delta t] is o(\Delta t).
Stationary distributions and detailed balance
A probability vector \pi is a stationary distribution for the Markov chain with generator A if
\pi P_t = \pi \quad \textrm{ for all } t =0
which, by Komlogorov equation, is equivalent to
\pi A = 0
In terms of the rate \alpha(i,j) we see that \pi is stationary if ansd only if
\sum_{i \not= j} \pi(i) \alpha(i,j) = \pi(j)\alpha(j) = \sum_{i \not=j}
\pi(j) \alpha(j,i)
which we can interpret as a balance equation: \pi(i)\alpha(i,j) is the rate at which the chain in a state \pi changes from i to j.
As for discrete time we say that a Markov chain satisfies detailed balance if
\pi(i) \alpha(i,j) = \pi(j) \alpha(j,i) \textrm{ for all } i \not= j
and detailed balance obviously implies stationarity.
The Markov chain with generator A is irreducible if for any pair of states i,j we can find states i_1, \cdots, i_{N-1} such that
\alpha(i,i_1) \alpha(i_2, i_3) \cdots \alpha(i_{N-1},j) >0
If there exists a stationary distribution for an irreducible chain then \pi(i)>0 for all i. Indeed if \pi(i)>0 and \alpha(i,j)>0 then \pi A(j) =0 implies that \pi(j)\alpha(j) = \sum_{k}\pi(k)\alpha(k,j) \ge \pi(i)\alpha(i,j)>0 and thus \alpha(j)>0.
The issue of periodicity cannot occur for continuous time Markov chains
Lemma 15.1 For an irreducible Markov chain with generator A, P_t(i,j)>0 for all i,j and all t>0.
Proof. Using the Markov property
\begin{aligned}
P\{X_t=j|X_0=i\} &\ge P \{ X_{t/N}=i_i, X_{2t/N}=i_2, \cdots X_{t}=j |X_0=i\} \\
&= P \{ X_{t/N}=i_1| X_0=i\} P\{ X_{2t/N}=i_2| X_{t/N}=i_i\} \cdots P\{ X_{t}=j |X_{t\frac{N-1}{N}}=i_{n-1}\}
\end{aligned}
and for example
P \{ X_{t/N}=i_1| X_0=i\} \ge \int_0^{t/N} \alpha(i) e^{-\alpha(i)s} Q(i,i_1)
e^{-\alpha(i_1)(t-s)}>0
\blacksquare
Convergence to the stationary distribution
If the state space is finite and the chain is irreducible then we have convergence to equilibrium.
Theorem 15.1 If the state space S is finite and the Markov chain with generator A is irreducible then for any initial distribution \mu we have
\lim_{t \to \infty}\mu P_t = \pi
Proof. We use a spectral argument and the result for discrete time Markov chain. Pick a number a such that a is strictly greater than all entries of A (this is possible since A is finite). Consider the matrix
Q= \frac{1}{a}A + I.
Then Q is a stochastic matrix and the Markov chain Y_n with transition matrix Q is irreducible and aperiodic (because Q(i,i)>0). Then from Theorem 6.3 we know (details in the exercises) that Q has a simple eigenvalue 1 and all other eigenvalues \lambda satisfy |\lambda|<1. Now
Qf = \lambda f \iff Af =a(\lambda -1) f
and so 0 is a simple eigenvalue for A and the other eiegenvalue are of the form a(\textrm{Re}(\lambda) -1) + i a \textrm{ Im }(\lambda) and so the real part is strictly negative.
The vector 1=(1,1, \cdots,1 )^T is a right eigenvector for A and \pi is a left eigenvector for A. If we define the matrix \Pi as the matrix whose rows are equal to \pi, we have then
(P^t - \Pi) \begin{pmatrix} 1\\ \vdots \\1 \end{pmatrix}= 0 \textrm{ and } \pi(P^t - \Pi) =0\,.
Moreover if f the right eigenvector and g the left eigenvector for A for the eigenvalue \mu \not=0 then we have
\pi (Af) = \mu \pi f = (\pi A) f =0 \quad \textrm{ and } (g A) 1 = g (A1) = \mu g 1 =0
and thus we must have \pi f=0 and g 1 =0. Therefore
P^t - \Pi)f = P^t f = e^{\mu t} f \textrm{ and } g(P^t - \Pi) = gP^t = e^{\mu t} g.
This implies that P^t - \Pi has the simple eigenvalue 0 and the same eigenvalues e^{\mu t} as P^t and \mu has strictly negative real part. Therefore P^t - \Pi converges to 0 as t \to \infty, or
\lim_{t \to \infty} P_t(i,j)= \pi(j) .
Transient behavior
To study the transient behavior in continuous time we can use similar ideas as in discrete time.
Absorption probabilities: the absorption probabilities do not depend on the time spent in every state so they can be computed using the transition matrix Q(i,j) for the embedded chain Y_n and the formula in Section Absorption probabilities
Expected hittiing time: For example we have the following result
Theorem 15.2 Supose X_t is a Markov chain with generator A and let
\sigma(j) = \inf\{t \ge 0 ; X_t=i\}
be the first hitting time to state j. Let \tilde{A} the matrix obtained by deleting the j^{th} and j^{th} column from the generator A. Then we have
E[\sigma(j)|X_0=i] = \sum_{l \not= j} B(i,l) \quad \textrm{ where } B= \tilde{A}^{-1}
The matrix \tilde{A} has rowsums which are non-positive and at least one of the row must be strictly negative.
Proof. By conditioning on the first jump which happens at time T we have
E[\sigma(j)|X_0=i] = \underbrace{E[T|X_0=i]}_{\textrm{expected time until the first jump }} + \sum_{k\in S, k\not=j} P\{ X_T = k|X_0=i\} \underbrace{E[ \sigma(j)|X_0=k]}_{\textrm{expected hitting time} \atop \textrm{ from the state after the first jump}}
If we set b(i)=E[\sigma(j)|X_0=i] (for i \not= j) we find the equation
b(i)= \frac{1}{\alpha(i)} + \sum_{k \not= j} \frac{\alpha(i,k)}{\alpha(i)} b(k) \implies 1 = \alpha(i) b(i) - \sum_{k\not= j}\alpha(i,k) b(k)
which reads, in matrix form as
1 = -\tilde{A} b \implies b= (-\tilde{A})^{-1} 1.
To show that -\tilde{A} is invertible we consider the matrix Q= \frac{1}{a}\tilde{A} + I where a is chosen larger than all the entries. Then the entries of Q are non-negative and the rowsums do not exceed one with at least one strictly less than 1. By the results for discrete time Markov chains we know that I-Q=(-\frac{1}{a} A) is invertible.
Explosion
For a continuous time Markov chain X_t let us consider the time of the successive jumps
S_1=T_1, S_2=T_1+T_2, S_3=T_1+T_2+T_3, \cdots
Here the T_i are independent exponential but, in general, no identically distributed with paramters \alpha_i which depends on the state being visited. We have then
E[S_n]= \sum_{i=1}^n \frac{1}{\lambda_i}
Explosion: To see what can happen consider a Markov chain with rate \alpha(i,i+1)=(n+1)^2 and all other rates equal to 0. Then the Markov chain moves up by 1 at every jump like a Poisson process but at accelerated pace. We have then
E[S_\infty]= \sum_{n=1}\frac{1}{n^2} < \infty
so S_\infty < \infty with probability 1. So there are infinitely many jumps in finite time and X_t=+\infty after a finite time. This is called explosion.
This is an issue familiar in ODE: the equation \frac{d}{dt} x_t = x_t^2 has solution has solution x_t= \frac{x_0}{x_0-t} which blows up at time t=x_0.
It is not easy to determine if an explosion really occurs. Indeed for no explosion to occur we must have, with probabilty 1,
\sum_{n} \frac{1}{\alpha(Y_n)} = \infty
where Y_n is the embedded chain.
A sufficient condition for non-explosion is a suitable upper bound on the rates \alpha(i), say \alpha(i) \le \alpha which is true for finite state spaces but this is by no means necessary.
Transience, recurrence, and positive recurrence.
The recurrence and transience for a continuous time Markov chain can be defined as for discrete time. A an irreducbile Markov chain X_t is recurrent if the chain starting from a state i returns to i with probability 1 and transient if X_t never return to i with some non-zero probability.
Transience/recurrence has nothing to do with the actual time spent in every state and so we have immediately
X_t \textrm { with rates } \alpha(i,j) \textrm{ is } \begin{array}{l} \textrm{transient} \\ \textrm{recurrent}\end{array} \iff
Y_n \textrm{ with transition } Q(i,j)=\frac{\alpha(i,j)}{\alpha(i)}
\textrm{ is } \begin{array}{l} \textrm{transient} \\ \textrm{recurrent}\end{array}
For positive recurrence we need to define the first return to a state i. This is most easily defined in terms of the return for the embeded Markov chain Y_n and its first return time \tau(i). The time until X_0 returns to i will the sum of the time spent in each state. If S_n denotes the time of the jumps and \tau(i)=n after visiting the state j_0, j_1, \cdots j_{n-1}, j_n=i then the return time is
\sum_{k=0}^{n-1}(S_{k+1}- S_k)
where S_{k+1}-S_k is exponential with paramter \alpha(j_k). The first return time to state i for the Markov chain X_t is given
\Sigma(i) = \sum_{k=0}^{\tau(i)-1}(S_{k+1}- S_k)
Stationary distribution for recurrent chains
Theorem 15.3 If the Markov chain with rates \alpha(i,j) is irreducible and recurrent, then there exists a unique solution (up to a multiplicative constant) \eta=(\eta(1), \eta(2), \cdots) to the equation \eta A=0 with
0 < \eta(i) < \infty \,.
If it holds that \sum_{i}\eta(i) < \infty then \eta can be normalized to a stationary distribution and X_t is positive recurrent.
Proof. The equation \eta A(k)=0 can be written as
\sum_{j \not= k} \eta(j) \alpha(j,k) = \alpha(k) \eta(k) \iff \sum_{j \not= k} \eta(j)\alpha(j) Q(j,k) = \eta(j)\alpha(j)
That is the row vector \mu with entries \mu(k)=\alpha(k) \eta(k) must satisfy \mu Q = \mu.
If X_t is recurrent then the embedded Markov chain Y_n with transition Q is recurrent and so by Theorem 9.1 there exists a solution to \mu Q=\mu and therefore we found a solution for \eta A=0.
Moreover, from Theorem 9.1 we have a the representation
\mu(j)=\alpha(j) \eta(j) = E\left[ \sum_{k=0}^{\tau(i)-1} {\bf 1}_{\{Y_k=j\}}\right]
where i is some fixed but arbitrary reference state (this counts the number of visits to the state j between two consecutive visits to the reference state i)
If we denote by S_n the time of the n^{th} jump for X_t we have
\begin{aligned}
\eta(j) &= \sum_{k=0}^\infty E\left[ \frac{1}{\alpha(j)} {\bf 1}_{\{Y_k=j\}} {\bf 1}_{\{\tau(i) >k\}} \right] = \sum_{k=0}^\infty E\left[ (S_{k+1} -S_k) {\bf 1}_{\{Y_k=j\}} {\bf 1}_{\{\tau(i) >k\}} \right]
= E\left[ \sum_{k=0}^{\tau(i)-1} (S_{k+1} -S_k) {\bf 1}_{\{Y_k=j\}} \right]
\end{aligned}
which is nothing but the time spent (by X_t) in the state j between successive visits to i.
If \sum_{j} \eta(j) <\infty then we have
E[\Sigma(j)]=E\left[ \sum_{k=0}^{\tau(i)-1} (S_{k+1} -S_k)\right] <\infty
which is the expected return time to state i. That is the chain X_t is positive recurrent.
Ergodic theorem for positive recurrent Markov chains
We have the following theorem which is the exact counterpart of the discrete time case (and is proved very similarly so we will omit the proof).
Theorem 15.4 Suppose X_t is irreducible and positive recurrent. Then X_t has a unique stationary distribution, and with probability 1, the time spent in state j, converges to \pi(j)
\lim_{t\to \infty}\int_0^t \mathbf{1}_{\{X_s=j\}} \, ds = \pi(j).
Moreover we have Kac’s formula: \pi(j) is also equal to the average time between consecutive visits to state j:
\pi(j) = \frac{1}{E[\Sigma(j)|X_0=j]}.
Conversely if X_t has a stationary distribution then X_t is positive recurrent.
Birth and death process
A general birth and death process is a continuous time Markov chain with state space \{0,1,2,\cdots\} and whose only non-zero transition rates are
\begin{aligned}
&\lambda(n) = \alpha(n,n+1) & = \textrm{ birth rate for a population of size } n & (\textrm{ for } n\ge 0) \\
&\mu(n)=\alpha(n,n-1) & = \textrm{ death rate for a population of size } n & (\textrm{ for } n\ge 1)
\end{aligned}
The Kolmogorov equations for the distribution of X_t are, for n\ge 1
\frac{d}{dt}p_t(n) = \underbrace{ \mu(n+1) p_t(n+1)}_{\textrm{ increase due do death} \atop \textrm{in a population of size } n+1} +
\underbrace{\lambda_{n-1} p_t(n-1)}_{\textrm{ increase due do birth} \atop \textrm{in a population of size } n-1} - \underbrace{(\lambda(n) + \mu(n))p_t(n)}_{\textrm{ decrease due do birth/death} \atop \textrm{in a population of size} n}.
For n=0, the eqation reads \displaystyle \frac{d}{dt}p_t(0)= \mu(1) p_t(1) - \lambda(0) p_t(0).
The generator has the form
\begin{matrix}
0 \\ 1 \\ 2\\ \vdots
\end{matrix}
\begin{pmatrix}
-\lambda(0) & \lambda(0) & 0 & 0 & \dots \\
\mu(1) &-\lambda(1) -\mu(1) & \lambda(1) & 0 & \dots \\
0 & \mu(2) &-\lambda(2) -\mu(2) & \lambda(2) & \dots \\
\vdots & \ddots & \ddots & \ddots & \\
\end{pmatrix}
- The transition matrix for the embedded process Y_n is
\begin{matrix}
0 \\ 1 \\ 2\\ \vdots
\end{matrix}
\begin{pmatrix}
0 & 1 & 0 & 0 & \dots \\
\frac{\mu(1)}{\mu(1)+ \lambda(1)} & 0 & \frac{\lambda(1)}{\mu(1)+ \lambda(1)} & 0 & \dots \\
0 & \frac{\mu(2)}{\mu(2)+ \lambda(2)} & 0 & \frac{\lambda(2)}{\mu(2)+ \lambda(2)} & \dots \\
\vdots & \ddots & \ddots & \ddots & \\
\end{pmatrix}