Stochastic Processes

Math 606, Spring 2023

Luc Rey-Bellet

University of Massachusetts Amherst

2023-05-07

1 Simulation

Simulation of random variables

Problem: How do we generate samples from a random variable X on a computer?

A computer has a built-in (pseudo-)random number generator.

# Generating random numbers
import numpy as np
np.random.rand(20)

array([0.88456123, 0.7246901 , 0.96134441, 0.18411352, 0.53739148,
       0.69442635, 0.53443413, 0.33034303, 0.29912758, 0.9107155 ,
       0.8946926 , 0.60576729, 0.77851862, 0.0902323 , 0.52653363,
       0.10969704, 0.2229144 , 0.11201706, 0.95301018, 0.26606789])

We idealize it as generating independent uniform random variable U_1, U_2, \cdots on the interval [0,1]. In practice we have only finite precision and the random number generator is pseudo-random.

How do we use random numbers to generate other random variables? This problem usually goes under the name of simulation of random variables.

There are a few general principles and lots of clever tricks to generate efficiently specific distributions.

Good references for simulations are for examples (Ross 2019) and (Rubinstein and Kroese 2017)

We give some examples and general in this section and we will add more along the way.

Generating discrete distributions

Theorem 1.1 (Generating discrete distributions) Suppose X is a discrete random variable taking values x_1, x_2, x_3, \cdots with p.d.f
p(j) = P(X=x_j)\,. To simulate X

Generate a random number U.
Set X\,=\, \left\{ \begin{array} {ll} x_1 & {\rm ~if~} U < p(1) \\ x_2 & {\rm ~if~} p(1) \le U < p(1) + p(2) \\ \vdots & \vdots \\ x_n & {\rm ~if~} p(1) + \cdots + p(n-1) \le U < p(1) + \cdots p(n) \\ \vdots & \vdots \\ \end{array} \right.

Then X has the desired distribution.

Proof. This follows simply from P(a \le U \le b) = b-a if 0\le a \le b \le 1.

Examples

Roll a dice

# Generating the roll of n dice
import numpy as np
n=10
np.ceil(6*np.random.rand(n))

array([5., 5., 1., 1., 4., 4., 3., 2., 5., 2.])

Geometric RV: If X geometric random variable X with parameter p p(n) =p (1-p)^{n-1} \quad \quad P(X \le n) = 1 - (1-p)^n \quad n=1,2,\cdots Then
1- (1-p)^{n-1} \le U \le 1- (1-p)^{n-1} \iff (1-p)^n \le (1-U) \le (1-p)^{n-1} or equivalently since (1-U) has the same distribution as U \textrm{Set } X=n \quad \textrm{ if } \quad(1-p)^n \le U \le (1-p)^{n-1}

In general it might be not easy to compute the CDF P(X \le n) so for specific distributions (say Poisson distributions) we will use special tricks.

Inversion method

Theorem 1.2 (Inversion method)

Suppose X is a continuous random variable with CDF F_X(t) and F_X(t) is invertible. Then X and F_X^{-1}(U) have the same distribution.
To simulate X generate a random number U and set X=F_X^{-1}(U).

Proof. Since P(U\le a)=a for 0\le a \le 1 P(F_X^{-1}(U)<t) = P(U \le F_X(t)) = F_X(t) and thus F_X^{-1}(U) has distribution F_X(t).

Example: If T has an exponential distribution with paramter \lambda then the CDF is F(t)=P(T \le t) = 1 -e^{-\lambda t} and F(t) = 1-e^{-\lambda t}=u \iff t = -\frac{1}{\lambda} \ln(1-u) = F^{-1}(U) So if U is uniform on [0,1] then T=-\frac{1}{\lambda} \ln(1-U) has an exponential dsitribution with parameter \lambda. Since U and 1-U have the same distribution then T=-\frac{1}{\lambda} \ln(U) has also an exponential distribution with parameter \lambda.

Downside: in general F and F^{-1} are not computable (solve them numerically).

Rejection method

The next method permits to simulate X provided we can simulate Y (X and Y should have the same support).

Theorem 1.3 (Rejection Method) Suppose that X has pdf f(x) and Y has pdf g(x), and that there exists a constant C such that \frac{f(y)}{g(y)} \le C \quad \quad \textrm{ for all } y To generate X do

Step 1: Generate the random variable Y.
Step 2: Generate a random number U.
Step 3: If U \le \frac{f(Y)}{g(Y)C} set X=Y otherwise reject and go back to Step 1.

The number of times the algorithm runs until a value for X is accepted is a geometric random variable with paramter \frac{1}{C}.

Proof. To obtain one value of X we need iterate the algorithm a random number of times, let us call it N, until the value is accepted. That is we generate random variables Y_1, \cdots, Y_N until Y_N is accepted and then set X=Y_N.

Let us compute the CDF of Y_N.

We have, by conditioning on Y (i.e. P(A)=\int_y P(A|Y=y)g(y) dy for some event A).

\begin{align} P \{ Y_N \le x \} = P \left\{ Y \le x \,|\, U \le \frac{f(Y)}{C g(Y)}\right\} &= \frac{ P \left\{ Y \le x \,, U\le\frac{f(Y)}{C g(Y)}\right\} } {P\left\{U \le \frac{f(Y)}{C g(Y)}\right\}} \\ &= \frac{ \int_{-\infty}^{\infty} P \left\{ Y \le x \,, U \le \frac{f(Y)}{C g(Y)} \,|\, Y=y \right\} g(y) \, dy } {P \left\{ U \le \frac{f(Y)}{C g(Y)}\right\}} \\ &= \frac{ \int_{-\infty}^{x} P\left(U\le \frac{f(y)}{C g(y)}\right)g(y)\,dy} {P \left( U \le \frac{f(Y)}{C g(Y)}\right)} \\ &= \frac{ \int_{-\infty}^{x} \frac{f(y)}{C g(y)} g(y) \, dy } {P \left( U \le \frac{f(Y)}{C g(Y)}\right)} = \frac{ \int_{-\infty}^{x} f(y) \, dy } { C P\left(U \le \frac{f(Y)}{C g(Y)} \right)} \,. \end{align}

If we let x \to \infty we obtain that C P \left( U \le \frac{f(Y)}{C g(Y)}\right)=1 and thus, as desired P(Y_N \le x ) \,=\, \int_{-\infty}^x f(x) \,dx \,. The above argument shows that at each iteration, a value for X is accepted with probability P\left( U \le \frac{f(Y)}{C g(Y)} \right) = \frac{1}{C} independently of the other iterations. Therefore the number of iterations needed is a geometric random with mean C.

Example

Suppose X has pdf f(x)=20x(1-x)^3 for 0 \le x\le 1. Since X is supported on [0,1] we pick Y uniform on [0,1] with g(y)=1. Then C = \max \frac{f(x)}{g(x)} = \max_{x\in [0,1]} 20x(1-x)^3 = \frac{135}{64} So generate two random numbers U_1 and U_2 and if U_1 \le \frac{256}{27}U_2 (1-U_2)^3 set X=U_2. The average proportion of accepted values is \frac{64}{135}=.4747..

import numpy as np
import matplotlib.pyplot as plt
def accept_reject(N):   # Generate N samples
    n_accept=0
    x_list = [] 
    while n_accept < N:
        a=np.random.rand(2)
        if a[0] < (256/27)*a[1]*(1-a[1])**3 :
            n_accept += 1
            x_list.append(a[1])
    return x_list
plt.figure(figsize=(5,3))  
plt.hist(accept_reject(100000), bins=100, density= 'true')
t = np.arange(0., 1., 0.02)
plt.plot(t, 20*t*(1-t)**3, 'r--' )
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Histogram vs the exact pdf')
plt.show()

Normal RV with the rejection method

We use the fact that we know how to generate an exponential random variable (we will take \lambda=1)

If X is a standard normal then \sigma X + \mu has mean \mu and variance \sigma^2 so it is enough to consider \mu=0 and \sigma^2=1.

If X is standard normal then, by symmetry, X and -X have the same distribution and the random variables Z=|X| has the pdf
f_Z(x) = \frac{2}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \quad \textrm{ for } x\ge 0 \,.

Conversely we can write X = V Z where V is a Rademacher random variable, that is V the RV with pdf
P(V= 1) = P(V= -1) = \frac{1}{2}. For example we can write V = 2 \lfloor 2U \rfloor -1 in terms of a random number U.

Bounding the pdf: \displaystyle\max_{x \in [0, \infty)} \frac{ \frac{2}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}}{e^{-x}} = \max_{x \in [0, \infty)} \sqrt{\frac{2e}{\pi}} e^{-\frac{(x-1)^2}{2}} = \sqrt{\frac{2e}{\pi}} \equiv C=1.3154
Algorithm: generate 3 random numbers U_1, U_2, U_3. If U_1 \le e^{-\frac{1}{2}(-\ln(U_2)-1)^2} set X= (2 \lfloor 2 U_3 \rfloor -1)(-\ln(U_2)), otherwise reject.

Box-Muller algorithm

Generate 2 independent standard normal using 2 random numbers.
Use the change of variable formula. Suppose X and Y are IID standard normal. Then for any (bounded) h \begin{aligned} E[ h(X,Y)] &= \int_{-\infty}^\infty \int_{-\infty}^\infty h(x,y) \frac{1}{2\pi} e^{-\frac{x^2 + y^2}{2}} \, dx \, dy \\ &= \int_0^{2\pi} \int_0^\infty h(r \cos(\theta), r \sin(\theta)) r e^{-\frac{r^2}{2}} dr \frac{1}{2\pi} d\theta \quad \quad x=r \cos(\theta), y=r \sin(\theta) \\ &= \int_0^{2\pi} \int_0^\infty h(\sqrt{s} \cos(\theta), \sqrt{s} \sin(\theta)) \frac{1}{2} e^{-\frac{s}{2}} ds \frac{1}{2\pi} d\theta \quad \quad s=r^2 \end{aligned}
So if S is exponential with paramter \frac{1}{2} and \Theta is uniform on [0,2\pi] then X = \sqrt{S}\cos(\Theta), \quad \quad Y= \sqrt{S} \sin(\Theta) are 2 independent standard normal random variables.
We have then a (rejection-free) algorithm: if U_1 and U_2 are two random numbers then X = \sqrt{-2 \ln(U_1)}\cos( 2 \pi U_2), \quad \quad Y= \sqrt{-2 \ln(U_1)}\sin( 2 \pi U_2) are 2 independent standard normal random variables.

Generating a random vector on the unit ball

Rejection method: the unit ball B_d=\{ x_1^2 + \cdots x_d^2 \le 1\} is contained in the cube C_d =\{ \max_{i} |x_i|\le 1 \}
- Generate {\bf Y}=(Y_1, Y_2, \cdots, Y_d) uniform on C_d (using d random numbers)
- If {\bf Y} \in B_d, set {\bf X}={\bf Y}, otherwise reject and repeat.
- We have
  \textrm{ Acceptance probability } = \frac{\textrm{volume of the sphere}}{\textrm{volume of the cube}}= \frac{\frac{ \pi^{d/2}}{d\Gamma(d/2)}}{2^d} \underbrace{=}_{d=2l} \frac{\frac{ \pi^{l}}{2l (l-1)!}}{2^{2l}} = \frac{1}{2 l!} \left(\frac{\pi}{4}\right)^l This is awful, for d=6 we have an acceptance probability of \frac{1}{12} \left(\frac{\pi}{4}\right)^3=0.0403 so 96\% of vectors are rejected!
A rejection free method:
- Generate a uniform distribution on the sphere S_{d-1}=\{ x_1^2 + \cdots x_d^2 \le 1\}: Take {\bf Y}=(Y_1, \cdots, Y_d) to be IID standard normal. The joint PDF, \propto e^{-\frac{1}{2}(x_1^2 + \cdots + x_d^2)}, is rotation invariant and so {\bf Z}=\frac{{\bf Y}}{\|{\bf Y}\|} is uniformly distributed on S_{d-1}.
- If {\bf X} is uniformly distributed on the unit ball, then using spherical coordinates we have for, r\le 1, P(\|{\bf X}\|\le r)= \int_0^1 r^{d-1} dr \int_{S_{d-1}} d \sigma = r^d.
- Then we claim that if U is a random number then {\bf X}= U^{1/d}Z is uniformmyl distributed on the unit ball, since the distribution of {\bf X} is rotation invariant and for r\le 1 P(\|{\bf X}\|\le r) =P( U^{\frac1d} \|{\bf Z}\| \le r) =P(U \le r^d) =r^d\,.

Generating random permutations

Our goal is to generate a random permtuation of n elements (1,2, \cdots, n). We write P=(p(1), p(2), \cdots, p(n)) for such a permutation. There are n! such permutation which makes enumeration a stupid idea.
The following algorithm does the following: after k steps it creates a random permutation of k elements, so you can use it for permutations of any size.
- Step 1: Set k=1
- Step 2: Set p(1)=1
- Step 3: If k=n stop. Otherwise set k=k+1
- Step 4: Generate a random number U and set p(k) = p( \lceil kU \rceil) \quad \textrm{ and } \quad p(\lceil kU \rceil) = k then go to step 3.
Since \lceil kU \rceil takes values in \{1,2,\cdots, k\} with probability \frac{1}{k} the algorithm assign the value k randomly to one the k position and assigns the values it replaces to the k^{th} position.
By induction, let us assume that after k-1 iteration we have obtained a random permutation of k-1 elements. Then by construction we have P \left( i_1, \cdots, i_{j-1}, k, i_{j+1}, \cdots, i_{k-1}, i\right)= P\left(i_1, \cdots, i_{j-1}, i , i_{j+1}, \cdots, i_{k-1}\right)\frac{1}{k} = \frac{1}{k!}
So we need, for example, 52 steps to create a perfectly shuffled set of cards.

2 Monte-Carlo methods

Law of large numbers

We start by recalling some basic limit theorems from probability theory.

X_1, X_2, \cdots independent and identically distributed random variables (IID). Think of repeating a random experiment n times and recording the results. We denote by S_n = X_1 + X_2 + \cdots + X_n the sum.
The Law of Large numbers shows that if X_i have a finite mean \mu then the empirical average \frac{S_n}{n} converges to \mu.

Theorem 2.1 (Law of Large numbers) Suppose that X_1, X_2, \cdots are IID random variables with E[X_i]=\mu.

Weak LLN: The average \frac{S_n}{n} converges to \mu in probability: \textrm{For any } \varepsilon>0 \quad \lim_{n\to \infty} P\left( \left| \frac{S_n}{n}- \mu\right| \ge \varepsilon \right) =0\,.
Strong LLLN: The average \frac{S_n}{n} converges to \mu almost surely. \lim_{n\to \infty}\frac{S_n}{n} = \mu \quad \textrm{ with probability } 1

Although the strong law of large numbers implies the weak law of large numbers (almost sure convergence implies convergence in probability), they both have their different practical use as we shall see.

The Monte-Carlo method

The (simple) Monte-Carlo method is a probabilistic algorithm using sums of independent random variables the law of large numbers to estimate a (deterministic) quantity \mu \in \mathbb{R} (or \mathbb{R}^d).
The basic idea is to express \mu as the expectation of some random variable \mu = E [ h(X) ] and then use the law of large numbers to build up an estimator for \mu.

Theorem 2.2 (Simple Monte-Carlo Sampling Algorithm) To compute \mu \in \mathbb{R}

Find a random variable h(X) such that \mu= E[h(X)].
I_N = \frac{1}{N} \sum_{k=1}^N h(X_k) is an unbiased estimator for \mu, that is we have
- For all N we have E[I_N] = \mu (unbiased).
- With probability 1, \lim_{N \to \infty} I_N = \mu.

An interesting part is that there are, in general, many ways to find the random variables h(X) and we will play with this, e.g. for importance sampling.
Conversely in many problems the random variable h(X) is given but the expection is too diffcult to compute, so we rely on the LLN to compute \mu.

Buffon’s needles

This seems to be the first example of a rejection sampling used to solve a mathematical problem, by Le Comte de Buffon (see Bio in Wikipedia).
A needle of length l is thrown at random on floor made on floorboards of width d and we assume l \le d. We want to compute the probability that the needle does intersect two floor boards.
Denote by X the distance to the nearest intersection (this is uniformly distributed on [0,\frac{d}{2}]) and by \Theta the acute angle between the needle and an horizontal line( this is uniformly distributed on [0,\frac{\pi}{2}].
For the needle to intersect we must have x \le \frac{l}{2} \sin(\theta) and thus P\left(X \le \frac{l}{2} \sin(\Theta)\right) = \int_0^{\frac{\pi}{2}} \int_0^{\frac{l}{2}\sin(\theta)} \frac{2}{d} dx \, \frac{2}{\pi} d\theta = \frac{2l}{d\pi}
So in order estimate \pi you shall throw N needles on the floors at random and
\pi \approx \frac{2lN}{d} \frac{1}{ \# \textrm { of needles intersecting two floor boards}} No random number generator needed.

Computing \pi

Use the rejection method: Enclose a disk of radius 1 in a square of side length 2 and consider the following Bernoulli random variable X.
- Generate 2 independent vectors V_1,V_2 uniformly distributed on [-1,1].
- If V_1^2 + V_2^2 \le 1, set X=1, otherwise set X=0.

X is Bernoulli with p=\frac{\textrm{area of the disk}}{\textrm{ area of the square}}=\frac{\pi}{4} so \frac{4}{N}\sum_{k=1}^N X_i \to \pi with probability 1

import numpy as np
N = 1000000

dart_positions = 2 * np.random.rand(N, 2) - 1  # generates numbers in [-1, 1]
Ncircle = [0] # start the count with 0 to make our loop-logic easier

for x,y in dart_positions:
    if  x**2 + y**2 < 1:
        Ncircle.append(Ncircle[-1] + 1) # another dart has fallen in the circle
    else:
        Ncircle.append(Ncircle[-1])  # the dart fell outside of the circle - Ncircle is unchanged

running_estimate = []
for n_total, n_circle in enumerate(Ncircle[1:]): # skip the inital 0-count
    # n_total starts at 0, so we need to add 1
    running_estimate.append(4 * n_circle / (n_total + 1))

np.transpose(running_estimate[-20:])

array([3.14248771, 3.14248856, 3.14248942, 3.14249028, 3.14249114,
       3.14249199, 3.14249285, 3.14248971, 3.14248657, 3.14248742,
       3.14248828, 3.14248914, 3.14249   , 3.14249085, 3.14249171,
       3.14249257, 3.14249343, 3.14249028, 3.14249114, 3.142492  ])

Monte-Carlo to compute integrals

Goal is to computes integral: \displaystyle I\,=\, \int_{S} h( {\bf x}) d {\bf x} for some function h:S\subset \mathbb{R}^d \to R.

To solve an integral by quadrature with a grid with side length \varepsilon you need O\left(\frac{1}{\varepsilon^d}\right) grid points (“curse of dimensionality”).

Monte Carlo: Rewrite the integral as an expectation of some random variables you now how to simulate, simulate N of those, and then use the estimator \displaystyle I\,=\, \int_{S} h( {\bf x}) d {\bf x} = E[g(Y)] \approx I_N=\frac{1}{N}\sum_{k=1}^N g(Y_k)

If \displaystyle I_1 \,=\, \int_0^1 \frac{ e^{\sqrt x}-e^{cos(x^3)}}{3 + \cos(x)} \, dx = E[h(U)] where U is a random number and h(x)=\frac{ e^{\sqrt x}-e^{cos(x^3)}}{3 + \cos(x)}

In general pick a random variable with pdf f(x) which is non-zero on S and extend h to \mathbb{R}^d by setting it to zero ousisde S. Then
I\,=\, \int_{S} h( {\bf x}) d {\bf x}\,=\, \int_{\mathbb{R}^d} \frac{h( {\bf x})}{f(\mathbf{x})}f(\mathbf{x}) d {\bf x} = E\left[ \frac{h( {\bf X})} {f( {\bf X})} \right] \quad X \textrm{ with pdf } f(x)

Goals

Performance gurantees: How many sample should we generate to obtain a given precision for the computation of \mu = E[h(X)]
- Use central limit theorems type results to obtain asymptotic confidence intervals.
  - Simple and useful but somewhat unsatisfying.
- Use concentration inequalities to obtain non-asymptotic confidence intervals.
  - True performance guarantess but harder to get and often too pessimistic.

How to we design better MC methods since there are many ways….

Central limit theorem

IID random variables X_1, X_2, \cdots *independent with \mu=E[X_i] and \sigma^2 = \textrm{Var}(X_i).

The central limit theorem tells us about the form fluctuation of the empirical average \frac{S_n}{n} around \mu as n \to \infty. Informally \frac{S_n}{n} \approx \mu + \frac{\sigma}{\sqrt{n}} Z \quad \textrm{ as } n\to \infty \quad \textrm{ with } Z \textrm { a standard normal RV.}

The meaning of \approx is convergence in distribution: . . .

Theorem 2.3 (Central limit theorem) Let X_1, X_2, \cdots be a sequence of independent identically distributed random variables with mean E[X_1]=\mu and variance \textrm{Var}(X_i)=\sigma^2. Then for any - \infty \le a \le b \le \infty we have \lim_{N \to \infty} P \left( a \le \frac{ S_N - N \mu}{ \sqrt{N} \sigma} \le b \right) = \frac{1}{\sqrt{2\pi}} \int_a^b e^{-\frac{x^2}{2}} \, dx \,. \tag{2.1}

The CLT is very general (it holds for any distribution) and very easy to use.
But is also asymptotic, true only as N\to \infty. Notot clear how large N we should pick for the asymptotic beahvior to be accurate!

We will build later non-asymptotic confidence interval

Confidence intervals (version 1)

We build build confidence interval for a Monte-Carlo estimator I_N=\frac{1}{N}\sum_{k=1}^N h(X_k), since by the Central limit theorem \frac{\sqrt{N}}{\sigma} ( I_N - \mu) is asymptotically normal.

To build a \alpha-confidence interval we let z_\alpha the number defined by

\alpha = \frac{1}{\sqrt{2 \pi}} \int_{-z_\alpha }^{z_\alpha} e^{-\frac{x^2}{2}} \, dx \quad \quad \textrm{ for example } \quad \left\{\begin{array}{c} z_{.90}=1.645... \,\,(90\% \textrm{ confidence interval}) \\ z_{.95}=1.960... \,\,(95\% \textrm{ confidence interval}) \\ z_{.99}=2.576... \,\,(99\% \textrm{ confidence interval}) \end{array} \right.

By the CLT \displaystyle P\left( \mu \in \left[ I_N - z_\alpha \frac{\sigma}{\sqrt{N}}\,,\, I_N + z_\alpha \frac{\sigma}{\sqrt{N}} \right] \right) \preceq \alpha. \quad \textrm{ as } N \to \infty.

Approximate \alpha Confidence Interval P \left( \mu \in [I_N - \epsilon, I_N + \epsilon] \right) \lessapprox \alpha \quad \textrm{ provided } \quad N \ge z_\alpha \frac{\sigma^2}{ \epsilon^2}

The issue with that formula is that we are trying to compute \mu so there is no reason to believe that the variance \sigma^2 should be known! To remedy this issue we built an estimator for \sigma^2 from our samples X_1, X_2, \cdots.

Central limit theorem with Slutsky theorem

Building an estimator for the variance

Theorem 2.4 (Estimator for the variance) Given IID random variables X_1, \cdots, X_N with variance \textrm{Var}(X_i)=\sigma^2 V_N = \frac{1}{N-1} \sum_{k=1}^N \left(X_k -\frac{S_N}{N}\right)^2 \tag{2.2} is an unbiased estimator for \mu, i.e. E[V_{N}]=\sigma^2 and \lim_{N \to \infty} V_N = \sigma^2 with probability 1.

Proof. Details in homework. It relies on the LLN and the following continuity properties: \frac{S_n}{n}\to \mu with prob. 1 and f continuous implies that f(\frac{S_n}{n}) \to f(\mu).

A theorem on weak convergence (Slutsky’s Theorem) implies the following useful version of CLT

Theorem 2.5 (CLT + Slutsky’s Theorem) Suppose X_1, X_2, \cdots are IID random variables with E[X_i]=\mu and \textrm{Var}(X_i)=\sigma^2. hen for any - \infty \le a \le b \le \infty we have \lim_{N \to \infty} P \left( a \le \frac{ S_N - N \mu}{ \sqrt{N V_N}} \le b \right) = \frac{1}{\sqrt{2\pi}} \int_a^b e^{-\frac{x^2}{2}} \, dx \,. \tag{2.3}

Confidence intervals (version 2)

We can get an asymptotic confidence interval using Theorem 2.5

P\left( \mu \in \left[ I_N - z_\alpha \frac{V_N}{\sqrt{N}}\,,\, I_N + z_\alpha \frac{V_N}{\sqrt{N}} \right] \right) \preceq \alpha \,. or

Approximate \alpha Confidence Interval P \left( \mu \in [I_N - \epsilon, I_N + \epsilon] \right) \lessapprox \alpha \quad \textrm{ provided } \quad N \ge z_\alpha \frac{V_N}{ \epsilon^2}

Take home message.
- Theory: build the Monte-Carlo estimator with the smallest possible variance long\rightarrow Variance Reduction
- For a precision \varepsilon ne needs N \sim \frac{1}{\varepsilon^2} samples.
- Practice: With your code always compute both I_N and V_N!

Examples

Compute \pi with accept/reject. Estimator I_N=\frac{4}{N} \sum_{k=1}^N {\bf 1}_{\{V_1^2 + V_2^2 \le 1\}} where V_i, i=1,2 is uniform on [-1,1] so we get \textrm{Var}(I_N) \,=\, \frac{16}{N}\frac{\pi}{4}\left(1-\frac{\pi}{4}\right)= \pi(4-\pi) = \frac{1}{N} 2.6967 To get \pi with a precision of \varepsilon=10^{-3} and 99% confidence we need at least N = 2.576 \times 2.6967 \times 10^6= 6.946,699 samples.
Let h(x)= \sqrt{1 - x^2} for x \in [0,1]. Then \int_0^1 h(x) dx = \frac{\pi}{4} and a estimator for \pi is I_N=\frac{16}{N} \sum_{k=1}^N h(U_k) with variance \textrm{Var}(I_N) \,=\, \frac{16}{N} \left( E[h(U)^2] -E[h(U)]^2 \right)= \frac{1}{N}\left(\frac{2}{3}-\left(\frac{\pi}{2}\right)^2\right)= \frac{1}{N} 0.797 Better, roughly 4 times less sample (plus using only half as many random numbers).

3 Variance reduction techniques

We discuss two examples on how to reduce the variance of an algorithm. This is a vast subject and there are many other technique to perform variance reduction e.g. conditional MC, control variates, stratified sampling, etc….. Some of them will appear in the exercises. Recommended readings are (Ross 2019), (Ross 2013), (Rubinstein and Kroese 2017) where you will find much more information.

Antithetic random variables and common random numbers

The idea is to use dependent random variables or dependent random numbers to decrease the variance. Recall the formula for variance: for random variables X and Y we have \textrm{Var}(X+Y) = \textrm{Var}(X)+ \textrm{Var}(Y)+ 2 \textrm{Cov}(X,Y) \,. If X and Y are negatively correlated, i.e. \textrm{Cov}(X,Y) < 0 then we can possibly reduce the variance.

To build up our intuition consider the random variables U and V=1-U, they are identically distributed and negatively correlated because if U \le \frac{1}{2} then V \ge \frac{1}{2} and vice versa: \left(U - \frac{1}{2}\right)\left((1-U)-\frac{1}{2}\right) = U(1-U) - \frac{1}{4} \le 0 \quad \textrm{ since } \max_{x} x(1-x) = \frac{1}{4}

The key idea for the general case is contained in the following theorem

Theorem 3.1 Suppose X_1, \cdots, X_n are independent random variables and f and g are increasing function of n variables. Then \textrm{Cov}( f(X_1, \cdots X_n) , g(X_1, \cdots X_n)) \ge 0 \,. %= E[f(X_1, \cdots X_n) g(X_1, \cdots X_n)] - E[f(X_1, \cdots X_n)] E[ g(X_1, \cdots X_n)] \ge 0 \tag{3.1}

Proof. By induction. For n=1, since f and g are increasing, for any x,y we have 0 \le (f(x)-f(y))(g(x)-g(y)) and thus for any random variable X, Y 0 \le E [ f(X)-f(Y))(g(X)-g(Y)) ] = E [f(X)g(X)]+ E[ f(Y)g(Y) ] - E [f(X)g(Y)] + E [f(Y)g(X)] If, in addition, X and Y are IID we get 0 \le 2 E[f(X) g(X)] - 2 E[f(X)] E[g(X)] = 2 \textrm{Cov}(f(X),g(X)) which proves Equation 3.1 for the case for n=1.

Assuming now that Equation 3.1 is true for n-1 we condition on the value of X_n and

\begin{aligned} E[ f( X_1, \cdots X_n) g (X_1, \cdots X_n) \vert X_n= x_n] & = E[ f(X_1, \cdots, X_{n-1}, x_n ) g ( X_1, \cdots X_{n-1}, x_n)] \quad \textrm {by independence} \\ & \ge E[ f(X_1, \cdots, X_{n-1}, x_n )] E[ g ( X_1, \cdots X_{n-1}, x_n)] \quad \textrm {by induction} \\ &= E[ f(X_1, \cdots, X_n )\vert X_n= x_n] E[ g ( X_1, \cdots, X_n)\vert X_n= x_n] \end{aligned}

Taking expectation on both sides one finds E[ f( X_1, \cdots, X_n) g (X_1, \cdots, X_n)] \ge E\big[ E[ f( X_1, \cdots X_n) \vert X_n] E[ g (X_1, \cdots X_n) \vert X_n] \big] Note that the functions E[ f( X_1, \cdots X_n) \vert X_n] and E[ g( X_1, \cdots X_n) \vert X_n] are increasing in X_n, so Equation 3.1 for n=1,

\begin{aligned} E\big[ E[ f( X_1, \cdots X_n) \vert X_n] E[ g (X_1, \cdots X_n) \vert X_n] \big] & \ge E\big[ E[ f( X_1, \cdots X_n) \vert X_n] \big] E\big [E[ g (X_1, \cdots X_n) \vert X_n] \big ] \\ & = E[ f( X_1, \cdots X_n)] E[g( X_1, \cdots X_n)] \end{aligned} and this proves the theorem.

A simple example

Compute \mu = E\left[ e^U \right] use MC (of course the answer is e-1…).
Since h(x)=e^x is increasing and g(x)=e^{1-x} is decreasing we can use antithetic RV.
We have the covariance \textrm{Cov}\left(e^U, e^{1-U}\right)=E\left[e^U e^{1-U}\right]-E\left[e^U\right] E\left[e^{1-U}\right]=e -(e-1)^2=-0.2342
We also have \textrm{Var}\left(e^U\right)= E\left[e^{2U}\right]- E\left[e^U\right]^2 = \frac{e^{2}-1}{1}-(e-1)^2=0.2420
To compare:
- Independent random numbers U_1,U_2 give the variance \textrm{Var}\left( \frac{e^{U_1}+e^{U_2}}{2} \right) = \frac{ \textrm{Var}\left(e^{U}\right)}{2}= 0.1210
- Antithetic random variables U, 1-U give the variance \textrm{Var}\left( \frac{e^{U_1}+e^{U_2}}{2} \right) = \frac{ \textrm{Var}\left(e^{U}\right)}{2} + \frac{\textrm{Cov}\left(e^U, e^{1-U}\right) }{2}=.0039
A variance reduction of 96.7 percent (not bad…)

Reliability function

A system consists of n components, each of which is either works or fails. We set x_i= \left\{ \begin{array}{cl} 1 & \textrm{ if component } i \textrm { works} \\ 0 & \textrm{ if component } i \textrm { fails} \end{array} \right. \,. We call \mathbf{x}=(x_1, \cdots, x_n) the state vector describing the state of the system.

A structure function h(x_1, \cdots, x_n) describe the state of the system
h(\mathbf{x})= \left\{ \begin{array}{cl} 1 & \textrm{ if the system is functioning} \\ 0 & \textrm{ if the system fails} \end{array} \right. \,, It is natural to assume that h is an increasing function of x_i (turning a system on cannot make the system fail)

Probabilistic description: assume that the components are independent and are described a Bernoulli random variable X_i with p_i = P( X_i=1 ) = P( \textrm{ component } i \textrm{ works}) = 1- P( \textrm{ component } i \textrm{ fails})
The RV h(\mathbf{X})=h(X_1, \cdots, X_n) is also Benoulli and we have E[h(\mathbf{X})] = P(\textrm{ system works }) and we may want to use Monte-Carlo

Examples of systems

The series structure: h(x_1,\cdots,x_n)=\min_{i} x_i (all components must work)
The parallel structure: h(x_1,\cdots,x_n)=\max_{i} x_i (at least one component must work)
The k-out-of-n structure: h(x_1,\cdots,x_n)= \left\{ \begin{array}{cl} 1 &\textrm{if }\sum_{i=1}^n x_i\ge k\\0 & \textrm{otherwise} \end{array} \right. (at least k components must work)
The bridge structure: h(x_1,x_2,\cdots, x_5) = \max \{ x_1 x_3 x_5, x_2 x_3 x_4,x_1 x_4, x_2 x_5 \}

Stucture function

General principled way to compute the structure function

A path A is a subset A \subset \{1,\cdots, n\} such that h(\mathbf{x})=1 if x_i=1 for all i\in A, that is it a subset of components which will ensure that the system works.
A minimal path A is a path A such that no subset of A is itself a path, that is it a minimal subset of components which will ensure that the system works.

Theorem 3.2 Using the notation x_A = \prod_{i\in A} x_i for any subset A \subset \{1,\cdots,n\}, the the structure function for an arbitrary system is given by

\begin{aligned} h(\mathbf{x}) & = \max_{A \,\,\textrm{minimal path}} X_A \\ & = 1 - \prod_{A \textrm{ minimal path}} (1 - X_A) \end{aligned}

Proof. For the system to work we need to have all components working for at least one minimal path A and so X_A=1. If this is the case \max_{A \,\,\textrm{minimal path}} X_A=1 and \prod_{A \textrm{ minimal path}} (1 - X_A)=0. \quad \blacksquare

The theorem is of limited usability since it requires enumeration of all minimal paths. If the system is moderately big, this is very difficult and/or tedious.

Monte-Carlo for structure functions

If we assume a probabilistic model, which is pretty reasonable. Think for example, of a routing network, where each node have a certain probability of failure.

MC for system reliability: Consider a system with stucture function h(\mathbf{x}) and assume the components X_k, k=1,\cdots, n are independent Bernoulli RB with with P(X_k=1)=p_k. To compute E[h(\mathbf{X})]=P(\textrm{system works})

1.Generate nN independent random numbers U_{ik} with 1\le i \le N and 1\le k \le n and set X_{ik}=\mathbf{1}_{U_{ik}\le p_k}.

Evaluate H(\mathbf{X_i}) where \mathbf{X_i}=(X_{i1}, \cdots, X_{in})
Set I_N \,=\, \frac{1}{N}\sum_{k=1}^N H(\bf X_i) is an estimator for P(\textrm{system works}).

It is important to note that often it easier to evaluate H( {\bf X}) (that is check if the system functions) rather than evaluate the structure function (well-known algorithm can be used).

The reliability function g(U_1, \cdots, U_n)= h\left( \mathbf{1}_{\{U_1\le p_1\}}, \cdots, \mathbf{1}_{\{U_n\le p_n\}} \right) is an increasing function so \textrm{Cov}(g(U_1, \cdots, U_n),g(1-U_1, \cdots, 1-U_n) ) <0 and we can reuse each random number and reduce the variance.

Importance sampling

Consider continuous RVs but the same ideas works for discrete RVs. Suppose X has pdf f(x) and we want to compute \mu= E[h(X)] = \int h(x) f(x) dx. One can always use the simple MC algorithm I_N =\frac{1}{N} \sum_{k=1}^N h(X_k).

The idea behind importance sampling is to sample from another random variable Y with density g and write \mu= E[h(X)] = \int h(x) f(x) dx = \int h(x) \frac{f(x)}{g(x)} g(x) dx = E\left[ \frac{h(Y) f(Y)}{g(Y)}\right] \,. For the integral to make sense we need to pick g such that g(x) = 0 is allowed only if f(x)=0 or h(x)f(x) =0.

Often in practice if X belongs to some family, say an exponential RV, we will pick Y from the same family, say an exponential or a gamma random variable but with other paramters which we should pick carefully.

Importance sampling Monte-Carlo estimator: To estimate \mu = E\left[ h(X) \right] the importance sampling algorithm use the random variable Y with pdf g and has the estimator J_N = \frac{1}{N} \sum_{k=1}^{N} \frac{h(Y_k) f(Y_k)}{g(Y_k)}

Notice the similarity (in spirit) with rejection method.

How to pick g?

The variance of the simple MC estimator I_N is \textrm{Var}(I_N) = \frac{1}{N} ( E[ h(X)^2] - E[h(X)]^2).

For the importance sampling estimator J_N we have \begin{aligned} \textrm{Var}(J_N) =\frac{1}{N} \textrm{Var}\left( \frac{ h(Y) f(Y)}{g(Y)} \right) &=\frac{1}{N} \left[ \int \frac{ h(x)^2 f(x)^2}{g^2(x)} g(x) \, dx - \left( \int h(x) \frac{f(x)}{g(x)} g(x) dx \right)^2 \right] \\ &= E \left[ h(X)^2 \frac{f(X)}{g(X)}\right] - E[h(X)]^2 \end{aligned}

We need E\left[ h(X)^2 \frac{f(X)}{g(X)}\right]\le E \left[ h(X)^2\right] \iff \frac{f(x)}{g(x)} \textrm{ "small" when } h(x) \textrm{ is "big" },. that is we want to sample more (with Y and pdf g than with X and pdf f) in the region where h is big. Hence the name importance sampling.

The next result shows what is the best choice of g, even if it is practically not very useful since it requires the knowledge of what we are trying to compute in the first place (either \mu=E[h(X)] or E[|h(X)|].)

Theorem 3.3 (Optimal importance sampling)

If h(x)\ge 0 then the optimal importance sampling distribution is \displaystyle g_*(x) \,=\, \frac{h(x) f(x)}{\mu} = \frac{h(x) f(x)}{E[h(X)]}\,, and the corresponding importance sampling estimator J_N has 0 variance.
For general h the optimal importance sampling distribution is \displaystyle g_*(x) \,=\, \frac{|h(x)| f(x)}{E[ |h|(X)] }.

Proof.

If h(x) \ge 0 then with g_*(x) \,=\, \frac{h(x) f(x)}{\mu} we have \int g(x) dx =1 and N \textrm{Var} (J_N) = E \left[ h(X)^2 \frac{f(X)}{g_*(X)}\right] - E[h(X)]^2 = E \left[ \frac{h(X)^2 f(X) \mu}{ h(X) f(X) }\right]- E[h(X)]^2 \,=\, 0 \,. and thus it is optimal.
For general h the optimal variance does not vanish but using Cauchy-Schwartz inequality in the form E[Z]^2 \le E[Z^2] we obtain for any choice of g, \begin{aligned} E \left[ h(X)^2 \frac{f(X)}{g_*(X)}\right] & = E \left[ \frac{ h(X)^2 f(X)}{|h|(X) f(X)} \right] E[ |h|(X)] = E \left[ |h|(X) \right]^2 = E \left[ |h|(Y) \frac{ f(Y) }{g(Y)} \right]^2 \\ & \le E \left[ h(Y)^2 \frac{f(Y)^2}{g(Y)^2}\right] = E \left[ h(X)^2 \frac{f(X)}{g(X)}\right] \end{aligned} and thus g_* gives the optimal variance.

Examples

Suppose we want to sample the mean of standard normal RV X using an importance sampling estimator (it may look like a trivial example but wait…).
So we take h(x)=x and let us pick Y to be a normal RV with \mu=0 and variance \sigma^2 E\left[h(x)^2 \frac{f(x)}{g(x)}\right] = \int_{-\infty}^\infty x^2 \frac{ (e^{-x^2/2}/\sqrt{2 \pi})^2}{e^{-x^2/2\sigma}/\sqrt{2 \pi} \sigma } \, dx = \sigma \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} x^2e^{-x^2( 2 - \sigma^{-2})/2} \, dx = \left\{ \begin{array}{cl} \frac{\sigma}{(2-\sigma^{-2})^{3/2}} & \textrm{ if } \sigma^2 > 1/2 \\ +\infty & \textrm{otherwise} \end{array} \right. An elementary computation shows that the optimal choice is \sigma^2 = 5/2.

Suppose X is exponential with paramter \lambda and h = {\bf I}_{[a,\infty)} so that we want to estimate \mu= E[ h(X)] = P( X \ge a) = e^{-\lambda a}\,. Note e^{-\lambda a} maybe very small, e.g. 10^{-5}. The variance is e^{-\lambda a}(1- e^{-\lambda a}) \approx e^{-\lambda a} and so we need the precision to be at least \varepsilon=10^{-6} to have a meaningful estimate. We need then N = O\left(\frac{\sigma^2}{\epsilon^2}\right)= O(10^7)….
The optimal importance sampling is g_*(x) = \frac{h(x) f(x)}{ e^{-\lambda a}} = 1_{[a,\infty)} \lambda e^{-\lambda(x-a)} which is a shifted exponential random variable. Note that if sample from X we will mostly obtain h(X)=0 and most samples are “useless” while if we sample from the importance sampling distribution every single samples contribute a non-zero term to the estimator.

Tilted distribution

Tilted distributions are a natural tool for importance sampling. If f(x) is the pdf of X consider the random variable X^{(t)} with the new distribution f_t(x) = \frac{e^{tx}}{M(t)}f(x) \quad \quad \textrm{ where } M(t)=E[e^{tX}]=\int e^{tx} f(x) dx that is M(t) is the moment generating function of the random variable X. We have M'(t)=E[Xe^{tX}], M''(t)=E[X^2e^{tX}], \cdots

Example: If f(x)=e^{-\lambda x} (exponential distribution) then M(t)=\frac{1}{\lambda -t}, for t < \lambda and thus f_t(x)=(\lambda-t) e^{-(\lambda-t)x}\, X^{(t)} an exponential distribution with parameter \lambda -t.

Example: If p(k)=p^k(1-p)^{1-k}, k=0,1 is a Bernoulli distribution then M(t)=pe^{t} + (1-p) and thus p_t(k)= \frac{e^{tk}}{M(t)}p^k(1-p)^{1-k} = \left(\frac{pe^{t}}{pe^{t} + (1-p)}\right)^k \left(\frac{1-p}{pe^{t} + (1-p)}\right)^{1-k}. X^{(t)} is again a Bernoulli distribution with success probability p_t \equiv \frac{pe^{t}}{pe^{t} + (1-p)}

and so on

Rare event simulation

Suppose X_1, \cdots X_n are IID random variables with joint densities f(x_1)f(x_2)\cdots f(x_n) and S_n=X_1+\cdots+ X_n the sum. We want to estimate the (high-dimensional integral) P\left( \frac{S_n}{n} \ge a\right) = E\left[1_{\{S_n \ge n a\}}\right] = \int 1_{\{S_n \ge na\}} f(x_1)f(x_2)\cdots f(x_n) dx_1 \cdots dx_n If a > E[X_1] this probability is very small (exponentially small in n, see Chapter 4) and thus the simple Monte-Carlo will fail miserably.

Tilting the measure to do importance sampling we now have to sample the random variable 1_{\{S_n \ge na\}} \prod_{i=1}^n \frac{f(X^{(t)}_i)}{f_t(X^{(t)}_i)} = 1_{\{S_n \ge na\}} \prod_{i=1}^n \frac{M(t)}{e^{t X^{(t)}_i}} = 1_{\{S_n \ge na\}} e^{-tS_n} M(t)^n \le \left( M(t)e^{-ta}\right)^n

It makes sense to make this random variable as small as possible thus minimize the bound M(t)e^{-ta}. Differentiating we find M'(t)e^{-ta} - a M(t) e^{-ta}=0 or a = \int x \frac{e^{tx}}{M(t)} f(x) dx = E[X_t], \quad X_t \textrm{ with pdf } f_t(x) \tag{3.2} That is we tilt the probability f so that a is the mean of the tilted pdf. So when sampling X_t we should expect to see often values around a making the estimation much easier!

Numerical example:
- X_i is binomial with p=0.4, n=20 and a=0.8.
- Since E[X]=p for a binomial the equation for the optimal t, Equation 3.2, gives
  \frac{pe^{t}}{pe^{t} + (1-p)} = a \quad \iff e^{t}= \frac{a}{1-a}\frac{1-p}{p}
- We find e^t=6 and M(t)=(0.4 e^t +0.6 )= 3 and thus 1_{\{S_{20} \ge 16\}} \prod_{i=1}^{20} \frac{p(X^{(t)}_i)}{p_t(X^{(t)}_i)} =1_{\{S_{20} \ge 16\}} \left(\frac{1}{6}\right)^{S_{20}} 3^{20} \le \left(\frac{1}{6}\right)^{16} 3^{20}= 0.001236
- We can compute here explicitly that P(S_{20} \ge 16)=0.000317 so we have \textrm{ Simple MC: } \textrm{Var}\left( 1_{\{S_{20} \ge 16\}}\right)= 0.000317(1-0.000317)= 3.169 \times 10^{-4} On the other for the importance sampling the new random variable we sample takes values between 0 and 0.001236 and so \textrm{ IS MC: } \textrm{Var}\left((1_{\{S_{20} \ge 16\}} \left(\frac{1}{6}\right)^{S_{20}} 3^{20}\right) \le \frac{(0.001236)^2}{4}= 3.81 \times 10^{-7} (We have used that 0\le Y\le a implies that \textrm{Var}(Y) \le \frac{a^2}{4}, see Chapter 4 for a proof. )

A reliability example

Each edge fails with probability q=10^{-2} and the edges are independent.
\mathbf{X}=(X_1, \cdots, X_{22}) has pdf (product of Bernoulli) p(\mathbf{x}) \,\equiv\, P( {\bf X}= {\bf x}) \,=\, \prod_{i=1}^{22} q^{x_i} (1-q)^{1 - x_i} \,=\, q^{|{\bf x}|} (1-q)^{22 - |{\bf x}|} with |x|=\sum_{i=1}^{22}x_i.
h(\mathbf{x})=1 if there is no path from 1 to 22 along working edges and we want \mu=E[h(X)] (failure probability)

Figure 3.2: Connecting node 1 to node 22

At least three edges must fail for the system to fail so we have ther bound \mu \le P(|X| \ge 3) = 1 - \sum_{j=0}^{2} { 22 \choose j} q^j (1-q)^{22-j} = 0.00136 \,.

To get a lower bound for \mu by noting that \mu \ge P( e_1, e_2, e_3 {\rm ~fail}) = q^3 \,=\, 10^{-6}.
So the failure probability os between 10^{-6} and 10^{-3}.

Use importance sampling with Y with q(\mathbf{y}) \,\equiv\, P( {\bf Y}= {\bf y}) \,=\, \prod_{i=1}^{22} \theta^{y_i} (1-\theta)^{1 - y_i} \,=\, \theta^{|{\bf y}|} (1-q)^{22 - |{\bf y}|}.

How to pick \theta?

Since we need at least 3 non-working edges we may want to pick \theta such that the average nnumber of non-working edges is 3 for \mathbf{Y}, that is we choose \theta=\frac{3}{22} (since |Y| is a binomial random variable, E[|Y|]= 22 \theta).

IS estimator J_N = \frac{1}{N} \sum_{i=1}^{N} \frac{ k(\mathbf{Y}_i) p(\mathbf{Y}_i)}{q(\mathbf{Y}_i)} with variance \textrm{Var}(J_N) = \frac{1}{N} \left( \sum_{\mathbf{y}} \frac{ k(\mathbf{y} )^2 p(\mathbf{y})^2}{q(\mathbf{y})^2} q(\mathbf{y}) -\mu^2 \right) = \frac{1}{N} \left( \sum_{ \mathbf{y}: h({\bf y})=1} \frac{ p(\mathbf{y})}{q(\mathbf{y})} p(\mathbf{y}) -\mu^2 \right) \tag{3.3} Note that \frac{ p(\mathbf{y}) }{q(\mathbf{y})} \,=\, \frac{q^{| \mathbf{y}|}(1-q)^{22- |\mathbf{y}|}}{\theta^{|\mathbf{y}|}(1-\theta)^{22- |\mathbf{y}|}} = \left( \frac{ 1-q}{1-\theta}\right)^{22} \left(\frac{ q (1-\theta) }{\theta(1-q)} \right)^{|\mathbf{y}|}= 20.2 \times (0.064)^{|{\bf y}|} In Equation 3.3 all terms with k({\bf y})=1 have |{\bf y}|\ge 3 and for those {\bf y} we have \frac{ p({\bf y})}{\phi({\bf y})}\,\le\, 20.2 \times (0.064)^{3} \le 0.0053. So we get \textrm{Var}(J_N) \,\le\, \frac{1}{N} \sum_{{\bf y}\,:\, k({\bf y})=1} 0.0053 \,\,p({\bf y}) = \frac{ 0.0053\,\mu}{N} Reduced the variance by a factor approximately of 200 so weed need \sqrt{200} \cong 14 times less samples.

4 Concentration inequalities and non-asymptotic confidence intervals

We discuss in this section a few basic concentration inequalties, Chernov bounds, Hoeffding bounds and McDiarmid bounds and give some applications. Concentration inequalties are a very important tools to provide performance guarantees for Monte-Carlo methods, but also in a variety of other contexts in Uncertainty Quantification and Machine Learning. For further reading we recommend the excellent book (Boucheron, Lugosi, and Massart 2013).

Chernov bounds

We start with a basic but fundamental bound

Theorem 4.1 (Markov and Chernov inequalities)

Markov inequality: Suppose Y\ge 0 is a non-negative random variable and a>0 a positive number. Then P(Y \ge a) \le \frac{E[Y]}{a}\,. \tag{4.1}
Chernov inequality. Suppose X is a random variable and a \in \mathbb{R} then \begin{aligned} & P(X \ge a) \le \inf_{t \ge 0} \frac{E[e^{tX}]}{e^{ta}} \\ & P(X \le a) \le \inf_{t \le 0} \frac{E[e^{tX}]}{e^{ta}} \end{aligned} \tag{4.2}

Proof. If Y\ge 0 then \displaystyle Y=Y 1_{\{Y\ge a\}} + Y1_{\{Y\le a\}} \ge Y1_{\{Y\ge a\}} \ge a 1_{\{Y\ge a\}} and thus \displaystyle E[Y] \ge a P(Y\ge a) which is Equation 4.1.
Using Markov inequality we have for t \ge 0 P(X \ge a) = P\left( e^{tX} \ge e^{ta} \right) \le \frac{E[e^{tX}]}{e^{ta}} \,. Optimizing over t gives the first inequality in Equation 4.2. The second inequality is obtained by replacing X by -X.

Examples of Chernov bounds

If X is normal with mean \mu and varaince \sigma^2, then M(t)=e^{\mu t + \frac{\sigma^2 t^2}{2}} and, for a = \mu + \varepsilon, Chernov gives P(X \ge \mu + \varepsilon) \le \inf_{t \ge 0} \frac{e^{\mu t + \frac{\sigma^2 t^2}{2}} }{e^{t(\mu + \varepsilon)}} = e^{ -\sup_{t\ge 0}\left\{ t\varepsilon - \frac{\sigma^2 t^2}{2}\right\} } = e^{-\frac{\varepsilon^2}{2\sigma^2}} This is sharp: one can show (see exercises) that \left(\frac{1}{\varepsilon}-\frac{1}{\varepsilon^3\sigma^2} \right)\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}} \le P(X \ge \mu + \varepsilon) \le \frac{1}{\varepsilon}\frac{1} {\sqrt{2\pi}\sigma} e^{-\frac{\varepsilon^2}{2\sigma^2}}

Consider X_n is binomial with parameters (n,p) then M(t)=\left(pe^{t}+(1-p)\right)^n. We will use that

\sup_{t} \left\{ tz - \ln(pe^t + (1-p)) \right\} = \left\{ \begin{array}{cl} + \infty & \textrm{ if } z<0 \textrm{ or } z>1 \\ z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} := H(z||p) & \textrm{ if } 0 \le z \le 1 \end{array} \right. H(z||p)=z \ln \frac{z}{p} + (1-z) \ln \frac{1-z}{1-p} is the KL-divergence or (relative entropy) between Bernoulli distributions with paramters z and p respectively.

Chernov bound gives P\left(X_n \ge n(p+\varepsilon)\right) \le \inf_{t >0} \frac{(pe^t + (1-p))^n}{e^{tn(p+\varepsilon)}}= e^{-n \sup_{t\ge 0}\left\{ t(p+\epsilon)-\ln(pe^t+(1-p)) \right\}} = e^{-n H( p+\varepsilon||p)} and similarly P\left(X_n \le n(p-\varepsilon)\right) \le e^{-n H( p-\varepsilon||p)}

Hoeffding’s bound

Chernov bounds are very sharp but requires knowledge of the mgf M_X(t).

One of the main idea behind concentration inequalties: given X bound M_X(t) \le M(t) by the mfg of a random variable Y which you know explcitly. Mostly here we take Y a Gaussian but one also uses other, Bernoulli, Poisson, Gamma, etc…

The following elementary bound will be used repeatedly.

Lemma 4.1 (Hoeffding’s bound) Suppose a \le X \le b with probability 1. Then for any \varepsilon > 0

Bound on the variance \displaystyle \textrm{Var}(X) \le \frac{(b-a)^2}{4}
Bound on the mgf \displaystyle E\left[ e^{tX} \right] \le e^{t E[X]} e^{\frac{t^2 (b-a)^2}{8}}

Proof. For the bound on the variance a \le X \le b implies that \displaystyle -\frac{a-b}{2} \le X - \frac{a+b}{2} \le \frac{a-b}{2} and therefore \textrm{Var}(X) = \textrm{Var}\left(X - \frac{a+b}{2}\right) \le E\left[\left(X - \frac{a+b}{2}\right)^2\right] \le \frac{(b-a)^2}{4}\,.

Since X is bounded the moment generating function M(t) =\frac{e^{tX}}{E[e^{tX}]} exists for any t \in \mathbb{R}. To bound the M(t) let us consider instead its logarithmx u(t)=\ln M(t). We have \begin{aligned} u'(t) &= \frac{M'(t)}{M(t)} &=& E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right] \\ u''(t) &= \frac{M''(t)}{M(t)}- \left(\frac{M'(t)}{M(t)}\right)^2 &=& E\left[ X^2 \frac{e^{tX}}{E[e^{tX}]} \right] - E\left[ X \frac{e^{tX}}{E[e^{tX}]} \right]^{2} \end{aligned} We recognize u''(t) as the variance under the tilted measure with tilted density \frac{e^{tX}}{E[e^{tX}]} and thus by part 1. (applied to the tilted measure) we have u''(t) \le \frac{(b-a)^2}{4}.
Using the Taylor expansion with remainder we have, for some \xi between 0 and t \ln M(t)= u(t) = u(0) + u'(0)t + u''(\xi) \frac{t^2}{2} \le t E[X] + \frac{t^2(b-a)^2}{8}\,. This concludes the proof.

Remark: The bound on the variance in 1. is optimal. Indeed taking without loss of generality a=0 and b=1 then the variance is bounded by 1/4 and this realized by taking X to be a Bernoulli with p=\frac12. This bound says that the RV with the largest variance is the one where the mass is distributed at the end point.

The bound in 2. is optimal only in the sense that it is the best quadratic bound on u(t). For example for a Bernoulli with a=0 and b=1 we have M(t)=\ln (\frac{1}{2}e^{t} + \frac{1}{2})= \frac{1}{2}t + \ln \cosh\left(\frac{t}{2}\right) which is much smaller (for large t). There is room sometimes for better bounds (see e.g. (Boucheron, Lugosi, and Massart 2013)) but ising Gaussian is computationally convenient.

Hoeffding’s theorem

Theorem 4.2 (Hoeffding’s Theorem) Suppose X_1, \cdots, X_N are independent random variables such that a_i \le X \le b_i (almost surely). Then

\begin{align} P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \ge \varepsilon \right) &\le e^{- \frac{2 \varepsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}}\\ P \left( X_1 + \cdots + X_n - E[X_1 + \cdots + X_n] \le -\varepsilon \right) &\le e^{- \frac{2 \varepsilon^2}{ \sum_{i=1}^n(b_i-a_i)^2}} \end{align} \tag{4.3}

Proof. Using independence the Hoeffding’s bound we have e^{t (X_1 + \cdots + X_n - E[X_1 + \cdots + X_n])} =\prod_{i=1}^n e^{t(X_i-E[X_i])} \le \prod_{i=1}^n e^{\frac{t^2 (b_i-a_i)^2}{8}} = e^{ \frac{t^2 \sum_{i}(b_i-a_i)^2}{8}} and using Chernov bound (for a Gaussian RV with variance \displaystyle \sum_{i}\frac{(b_i-a_i)^2}{4}) gives the first bound in Equation 4.3.

Corollary 4.1 (non-asymptotic confidence interval) Suppose X_1, \cdots, X_N are independent random variables such that a \le X \le b (almost surely) and \mu=E[X_i].

P\left( \mu \in \left[ \frac{S_N}{N}-\varepsilon, \frac{S_N}{N} +\varepsilon \right] \right) \ge 1- 2 e^{-\frac{2 N \varepsilon^2}{(b-a)^2}}

We can rewrite this as

P\left( \mu \in \left[ \frac{S_N}{N}- \sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2N}}, \frac{S_N}{N} +\sqrt{ \frac{(b-a)^2\ln\left(\frac{2}{\delta}\right)}{2 N}} \right] \right) \ge 1 -\delta

Comparison to asymptotic confidence interval, take X_i to be Bernoulliwith unknown \mu=p so we bound the variance by p(1-p) \le \sigma_{max}=\frac{1}{4} (this is what we did in Hoeffding with a=0,b=1. So we have

(\textrm{CLT}) \quad z_{1-\delta} \frac{\sigma_{max}}{\sqrt{N}} \quad \textrm{ vs } \quad \sqrt{2\ln\left(\frac{2}{\delta}\right)} \frac{\sigma_{max}}{\sqrt{N}} \quad (\textrm{Hoeffding})

McDiarmid theorem

A very useful extension of Hoeffding’s applies to suitable function h(X_1, X_2, \cdots, X_n) other than just the sum.

We say that h(x_1, x_2, \cdots, x_n) satisifies the bounded difference property if there exist constants c_k such that for all x_k, x_k' |h(x_1,\cdots, x_k, \cdots, x_n) - h(x_1,\cdots, x_k', \cdots, x_n)| \le c_k

Theorem 4.3 (McDiarmid Theorem) Suppose X_1, \cdots X_n are independent RVs and f(x_1, \cdots, x_n) satisfies the bounded difference property (almost surely). Then we have \begin{aligned} P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \ge \varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \\ P\left( h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] \le -\varepsilon \right) \le & e^{- \frac{\varepsilon^2}{2\sum_{k=1}^n c_k^2}} \end{aligned}

Proof. The proof use the basic properties of conditional expectation.

“Martingale trick”: Let us define random variable Y_k by Y_0=E[f(X_1, \cdots, X_n)] and, for 1\le k \le n, Y_k =E[h(X_1,\cdots, X_n)|X_1, \cdots, X_k] Note that Y_n=h(X_1,\cdots, X_n) and also E[Y_k| X_1, \cdots, X_{k-1}]= Y_{k-1} (the martingale property). If we define further Z_k=Y_k-Y_{k-1} then we have the telescoping sum h(X_1,\cdots, X_n) - E[h(X_1, \cdots, X_n)] = \sum_{k=1}^n Z_k and E[Z_k|X_1,\cdots,X_{k-1}]=0.

Duplication trick: Let \widehat{X}_k be an independent copy of the random variable X_k. Then by the bounded difference property we have, almost surely,
\left| E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_k] - E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_k]\right| \le c_k \tag{4.4} Furthermore we have \begin{aligned} E[h(X_1, \cdots, X_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]&=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_{k-1}]\\ &=E[h(X_1, \cdots, \widehat{X}_k, \cdots, X_n)|X_1, \cdots, X_K] \end{aligned} \tag{4.5} The first equality holds because X_k and \widehat{X}_{k} are identically distributed and left-hand side is a function of X_1, \cdots, X_{k-1}. The second equality holds because X_k and \widehat{X}_{k} are idependent. Combining Equation 4.4 and Equation 4.5 shows that |Z_k| \le c_k almost surely.

Conditioning and Hoeffding: We have, using conditioning and Hoeffding’s bound from Lemma 4.1 \begin{aligned} E\left[ e^{t\sum_{k=1}^n Z_k} \right] &= E \left[ E \left[ e^{t\sum_{k=1}^{n-1} Z_k} e^{tZ_n} | X_1, \cdots X_{n-1}\right] \right] = E \left[ e^{\sum_{k=1}^{n-1} Z_k} E \left[ e^{Z_n} | X_1, \cdots X_{n-1}\right] \right] \le e^{\frac{t^2 c_n^2}{2}} E \left[ e^{t \sum_{k=1}^{n-1} Z_k}\right] \end{aligned} where we used that E[Z_n|X_1, \cdots, X_{n-1}]=0 and \textrm{ Var}(Z_n|X_1, \cdots, X_{n-1}) \le c_n^2. By induction we find

E\left[ e^{t h(X_1, \cdots, X_n) -E[h(X_1, \cdots, X_n)]} \right] \le e^{\frac{t^2 \sum_{k=1}^n c_k^2}{2}} and the rest follows as usual.

Confidence interval for variance estimator

We show how to build confidence interval for the unbiased variance estimator V_N(X_1, \cdots, X_N)=\frac{1}{N-1} \sum_{k=1}^N\left(X_i - \frac{S_N}{N}\right)^2 = \frac{1}{N-1}\sum_{i=1}^N X_i^2 - \frac{S_N^2}{(N-1)N} If we change X_1 into \widehat{X}_1 then S_N(X_1, \cdots, X_N ) - S_N(\widehat{X}_1, \cdots, X_N ) = X_1 - \widehat{X}_1 and so V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N) = \frac{X_1^2 - \widehat{X}_1^2}{N-1} - \frac{X_1 - \widehat{X}_1}{N-1} \left(\frac{S_N(X_1, \cdots, X_N )}{N} + \frac{S_N(\widehat{X}_1, \cdots, X_N )}{N} \right) So if we assume |X_i|\le c we have the bound

|V_N(X_1, \cdots, X_N)- V_N(\widehat{X}_1, \cdots, X_N)| \le \frac{5c^2}{N-1} and thus by McDiarmid we get P( V_N - \sigma^2 \ge \varepsilon ) \, \le \, e^{ - \frac{(N-1)^2}{N}\frac{\varepsilon^2}{50c^4}} \,\le\, e^{ -(N-2)\frac{\varepsilon^2}{50c^4}} and this decay exponentially in N again and can be used for a confidence interval for the variance which is very similar to the one for the mean.

Bernstein bound

In Hoeffding’s bound we use, in an essential way, a bound on the variance. If the variance is small then one should expect the bound to be poor. Bernstein bound, which can be used if we have some a-priori knowledge about the variance, improves.

Theorem 4.4 (Bernstein Bound) Suppose X is a random variable such that |X-E[X]| \le c and \textrm{var}(X) \le \sigma^2. Then E[e^{t X}]\le e^{t E[X]+ \frac{\sigma^2 t^2}{2(1 -c|t|/3)}} \,.

Proof. We expand the exponential and use that for k\ge 2, E\left[(X-\mu)^k\right] \le E\left[(X-\mu)^2|X-\mu|^{k-2}\right]\le E[(X-\mu)^2] c^{k-2} \le \sigma^2 c^{k-2} and get \begin{aligned} E\left[e^{t(X-\mu)}\right]=1+ \sum_{k=2}^\infty \frac{t^k}{k!} E[(X-\mu)^k] &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}\frac{2}{k!}(|t|c)^{k-2} \\ &\le 1 + \frac{t^2\sigma^2}{2}\sum_{k=2}^\infty \left(\frac{|t|c}{3}\right)^{k-2} \quad \textrm{ since } \frac{k}{2!}\ge 3^{k-2}\\ &\le 1 + \frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})} \le e^{\frac{t^2\sigma^2}{2(1-\frac{|t|c}{3})}} \quad \textrm{ since } 1+ x \le e^x \end{aligned} and this concludes the proof.

Bernstein confidence interval

To obtain a confidence interval we need to solve an optimization problem: after some straightforward computation we find

\sup_{t \ge 0} \left\{ \varepsilon t - \frac{a t^2}{2(1-bt)} \right\} = \frac{a}{b^2}h\left( \frac{b\epsilon}{a}\right) \quad \textrm{ where } h(u) = 1 +u - \sqrt{1 + 2u} and we note that the h^{-1}(z)= z + \sqrt{2z}. We obtain the exact same bound for the left tail (by symmetry) and thus
to obtain a confidence interval we have \delta = 2 e^{-\frac{a}{b^2}h\left( \frac{b\varepsilon}{a}\right)} \iff \varepsilon = b \ln\left( \frac{2}{\delta}\right) + \sqrt{2a \ln\left( \frac{2}{\delta}\right)}

Theorem 4.5 (Bernstein for sum of IID) If X_1, \cdots, X_N are IID random variables with |X_i| \le c and \textrm{Var}(X_i)\le \sigma^2 then

P\left( \mu \in \left[\frac{S_N}{N} - \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) - \sqrt{\frac{2\sigma^2}{N} \ln\left( \frac{2}{\delta}\right)}, \frac{S_N}{N} + \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) + \sqrt{\frac{2\sigma^2}{N} \ln\left( \frac{2}{\delta}\right)} \right] \right) \ge 1 - \delta

Proof. This is Bernstein bound with a=N \sigma^2 and b=c/3 and \varepsilon \to N \varepsilon.

(\textrm{Bernstein}) \quad \frac{c}{3N} \ln\left( \frac{2}{\delta}\right) + \sqrt{2 \ln\left( \frac{2}{\delta}\right)} \frac{\sigma}{\sqrt{N}} \quad \textrm{ versus} \quad \sqrt{2\ln\left(\frac{2}{\delta}\right)} \frac{\sigma_{max}}{\sqrt{N}} \quad (\textrm{Hoeffding})

You should note that this bound can be substantially better than Hoeffding’s bound is \sigma \le \sigma_{max}

Comparing the bounds for Bernoulli

Hoeffdings amount to bounding the variance by \frac14. Bernstein which incorporporates knowledge of (or a bound on) the variance is much better than Hoeffdings as soon as the variance is not too big (0.1 in the epicture below.)

5 Markov Chains Basics

We introduce the basic definitions necessary to describe Markov chains and provide a first series of examples. For further reading we recommend the books Lawler (2006) and Levin et al. (2017).

Markov chain on a discrete state space

A stochastic process is a sequence of random variables \{X_0,X_1, X_2,\cdots \} where X_n take values in the space S, called the state space and which we assume for the moment to be discrete (either finite or countable) S=\{i_1, i_2, \cdots\}. Think of the index of X_n as (discrete) “time” and X_n as describing the state of the system at time n.

To describe a general stochastic process we need the joint pdf of X_0, \cdots, X_n (for any n) (the finite dimensional distributions) P \left( X_0= i_0, X_1=i_1, \cdots, X_n = i_n \right) = P \left( X_0=i_0\right) P\left( X_1=i_1 \vert X_0=i_0 \right) \cdots P\left( X_n=i_n \vert X_{n-1}=i_{n-1}, \cdots, X_0=i_0 \right) which we rewrote in terms of conditional probabilities.

The Markov property is an assumption on the structure of these conditional probabilities, namely that the future depends only on the present and not on the past, that is
P\left( X_n=i_n \vert X_{n-1}=i_{n-1}, \cdots, X_0=i_0 \right)= P(X_n=i_n|X_{n-i}=i_{n-1}) We think here of n-1 as the present, and n the future and interpret P(X_n=j|X_{n-i}=i) as the probability to be in state j at time n given that the state at time n-1 was i.
Therefore for a Markov process we can rewrite the joint pdf as P \left( X_0= i_0, X_1=i_1, \cdots, X_n = i_n \right) \,=\, P \left( X_0=i_0\right) P\left( X_1=i_1 \vert X_0=i_0 \right) \cdots P\left( X_n=i_n \vert X_{n-1}=i_{n-1} \right)

We say that a Markov process is time-homogeneous if P(X_n=j|X_{n-1}=i), that is the probability to move from i to j does not depend on the time n.

A time homogeneous Markov chain is specified by
- Initial distribution \mu(i) for i \in S where \mu(i)=P(X_0=i). We have \mu(i)\ge 0 \quad \textrm{ and } \sum_{i \in S} \mu(i) =1 and \mu(i) is called a probability vector .
- Transition matrix P(i,j) for i,j \in S where P(i,j)=P(X_n=j|X_{n-1}=i). We have P(i,j) \ge 0 \quad \textrm{ and } \sum_{j \in S} P(i,j)=1 and P(i,j) is called a transition probabilities
All quantities of interest for the Markov chain can be computed using these two objects. For example we have P\left\{ X_0 =i_0, X_1=i_1, X_2=i_2\right\} \,=\, \mu(i_0) P(i_0,i_1) P(i_1, i_2) \,. or P\left\{ X_2=i \right\} \,=\, \sum_{i_0\in S} \sum_{i_1 \in S} \mu(i_0) P(i_0,i_1) P(i_1, i) and so on.

Without loss of generality, we can relabel the states so that S=\{ 1, 2, 3, \cdots, N \} (with N\le\infty). It will be convenient to set

\mu = ( \mu(1), \mu(2), \cdots, \mu(N)) that is \mu is a row vector whose entries are the initial distribution.

Also we will write P for the N \times N matrix whose entries are P(i,j)

P \,=\, \left( \begin{array}{cccc} P(1,1) & P(1,2) & \cdots & P(1,N) \\ P(2,1) & P(2,2) & \cdots & P(2,N) \\ \vdots & \vdots & & \vdots \\ P(N,1) & P(N,2) & \cdots & P(N,N) \end{array} \right)

Kolmogorov equations

Proposition 5.1 (Kolmogorov equations) For a Markov process with initial distribution \mu and transition probabilities P:

1.The n-step transition probabilities are given by P \left\{ X_{n}=j \vert X_0={i} \right\} \,=\, P^n(i,j) where P^n is the matrix product \underbrace{P\cdots P}_{\rm n~times}.

If \mu(i)=P\{X_0=i\} is the distribution of X_0 then P\{ X_n=i\} \,=\, \mu P^n (i) is the distribution of X_n.
If f = (f(1), \cdots, f(n))^T is a column vector (think of f as a function f: S \to \mathbb{R}) then we have
P^n f(i) \,=\, E\left[ f(X_n) | X_0=i\right].

Proof. For 1. we use induction and assume the formula is true for n-1. We condition on the state at time n-1 , use the formula P( AB\vert C) = P(A \vert BC) P(B\vert C), the Markov property, to find

\begin{aligned} P \left( X_{n}=j \vert X_0={i}\right) &= \sum_{k \in S} P \left( X_{n}=j \,, X_{n-1}=k \vert X_0={i} \right) \\ &= \sum_{k \in S} P \left( X_{n}=j \vert X_{n-1}=k \,, X_0={i} \right) P \left( X_{n-1}=k \vert X_0={i} \right) \\ &= \sum_{k \in S} P \left\{ X_{n}=j \vert X_{n-1}=k \right\} P \left\{ X_{n-1}=k \vert X_0={i} \right\} \\ &= \sum_{k \in S} P^{n-1}(i, k) P(k,j) \,=\, P^n(i,j) \,. \end{aligned}

For 2. note that \mu P is a probability vector since \sum_{i} \mu P(i) \,=\, \sum_{i} \sum_{j} \mu(j ) P(j,i) \,=\, \sum_{j} \mu(j) \sum_i P(j,i) \,=\, \sum_j \mu(j) \,.

Furthermore by the formula for conditional probabilities part 1. P\{ X_n=j\} \,=\, \sum_{k \in S} P\{ X_n=j \vert X_0 =k\} P\{ X_0=k\} \,=\, \sum_{k} \mu(k) P^n(k,j) \,=\, \mu P^n (j)\,.

For 3. we have

P^nf (i) \,=\, \sum_{k} P^n(i,k) f(k) \,=\, \sum_{k} f(k) P \left\{ X_{n}=k \vert X_0={i} \right\} \,=\, E[f(X_{n})\,\vert \, X_0=i] \,.

Stationary and limiting distributions

Basic question: For Markov chain understand the distribution of \{X_n\} for large n, for example we may want to know whether the limit \lim_{n\to \infty} P\{ X_n=i\} \,=\, \lim_{n \to \infty} \mu P^n(i) exists or not, whether it depends on the choice of initial distribution \mu and how to compute it.

A probability vector \pi is called a limiting distribution if the limit \lim_{n \to \infty} \mu P^n = \pi exists. It could well be that the limit depend on the choice of the initial distribution \mu.

A probability vector \pi is called a stationary distribution if \pi P = \pi.

Limiting distributions are always stationary distributions: Suppose \pi is a limiting distribution and \lim_{n\to \infty} \mu P^n = \pi. Then \pi P \,=\, (\lim_{n \to \infty} \mu P^n ) P \,=\, \lim_{n \to \infty} \mu P^{n+1} \,=\, \lim_{n \to \infty} \mu P^{n} \,=\, \pi \,. and thus \pi is stationary.
In the next section we will derive conditions under which stationary distributions are unique and are limiting distributions.

Examples

Example 5.1 (2-state Markov chain) Suppose S=\{1,2\} with transtion matrix \displaystyle P \,=\, \left( \begin{array}{cc} 1-p & p \\ q & 1-q \end{array} \right) \,.

The equation for the stationary distribution, \pi P = \pi gives \pi(1) (1-p) + \pi(2) q \,=\, \pi(1) \quad \quad \pi(1) p + \pi(2) (1-q) \,=\, \pi(2) that is p \pi(1) = q \pi(2). Normalizing to a probability vector gives \displaystyle \pi = \left( \frac{q}{p+q},\frac{p}{p+q} \right).

We show \pi is a limiting distribution. Set \mu_n \equiv \mu P^n and consider the difference \mu_n -\pi. Using \mu_n(2)=1-\mu_n(1) we get the equation \begin{aligned} \mu_n(1) - \pi(1) &= \mu_{n-1} P(1) - \pi(1) = \mu_{n-1}(1) (1-p) + (1- \mu_{n-1}(1)) q - \frac{q}{p+q} \\ & = \mu_{n-1} (1)( 1-p-q) - \frac{q}{p+q}(1 -p -q) \,=\, (1-p-q) ( \mu_{n-1}(1) - \pi(1) ) \end{aligned} By induction \mu_n(1) - \pi(1) = (1-p-q)^n (\mu_0(1) - \pi(1)). If either p>0 or q>0 then \lim_{n \to \infty} \mu_n(1) = \pi(1) and this implies that \lim_{n\to \infty} \mu_n(2) = \pi(2) as well.

If either p or q does not vanish then \mu_n = \mu P^n converges to a stationary distribution for an arbitrary choice of the initial distribution \mu.

Example 5.2 (Coupon collecting) A company offers toys in breakfast cereal boxes. There are N different toys available and each toy is equally likelky to be found in any cereal box.

Let X_n be the number of distinct toys that you collect after buying n boxes and is natural to set X_0=0. Then X_n is a Markov chain with state space \{0,1,2,\cdots,N\} and it has a simple structure since X_{n} either stays the same of increase by 1 unit.

The transition probabilities are \begin{aligned} P(j, j+1) \,&=\, P\{ {\rm ~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{N-j}{N} \\ P(j, j) \,& =\, P\{ {\rm ~no~new~toy~} \vert {\rm ~already~j~toys}\} \,=\, \frac{j}{N} \end{aligned} The Markov chain X_n will eventually reach the state N and stays there forever (N is called an absorbing state). Let us denote by \tau the (random) finite time \tau it takes to reach the state N. To compute its expectation, E[\tau], let us write \tau = T_1 + \cdots +T_N \,, where T_i is the time needed to get a new toy after you have gotten your (i-1)^{th} toy. The T_i’s are independent and have T_i has a geometric distribution with p_i = (N-i)/N. Thus E[ \tau] \,=\, \sum_{i=1}^N E[ T_i] \,=\, \sum_{i=1}^N \frac{N}{N-i} \,=\, N \sum_{i=1}^N \frac{1}{i} \approx N \ln (N) \,.

Example 5.3 (Random walk on a graph) Consider an undirected graph G consists with vertex set V and edge set E (each edge e=\{v,w\} is an (unordered) pair of vertices). We say that the vertex v is a neighbor of the vertex w, and write v \sim w, if \{v,w\} is an edge. The degree of a vertex v, denoted \textrm{deg}(v), is the number of neighbor of v.

Given such a graph G=(V,E) the simple random walk on G is the Markov chain with state space V and transition matrix P(v,w) \,=\, \left\{ \begin{array}{cl} \frac{1}{{\rm deg} (v)} & {\rm if~} w \sim v \\ 0 & {\rm otherwise} \end{array} \right. \,.

The invariant distribution for the random walk on graph is given by \pi(v) \,=\, \frac{ {\rm deg} (v)} { 2 |E|} where is the number of edges. First note that \sum_{v} \pi(v)=1 since each edge connects two vertices. To show invariance note that \pi P(v) \,=\, \sum_{w} \pi(w)P(w,v) \,=\, \sum_{w ; w \sim v} \frac{{\rm deg} (w)}{2|E|} \frac{1}{ {\rm deg}(w)} \,=\, \frac{1}{2|E|} \sum_{w ; w \sim v}1 \,=\, \pi(v) \,.

P\,=\, \left( \begin{array}{ccccc} 0 & \frac13 & \frac13 & \frac 13 & 0 \\ \frac14 & 0 & \frac14 & \frac 14 & \frac14 \\ \frac13 & \frac13 & 0 & 0& \frac 13 \\ \frac13 & \frac13 & 0 & 0 & \frac13 \\ 0 & \frac13 & \frac13 & \frac13 & 0 \\ \end{array} \right) \quad \quad \quad \pi = \left( \frac{3}{16}, \frac{4}{16}, \frac{3}{16}, \frac{3}{16}, \frac{3}{16}\right)

Example 5.4 (Random walk on the hypercube) The d-dimensional hypercube graph Q_d has for vertices the binary d-tuples \mathbf{x}=(x_1, \cdots, x_d) with x_k \in \{0,1\} and two vertices are connected by an edge when they differ in exactly one coordinate (flipping a 0 into 1 or vice-versa).

The simple random walk on Q_d moves from one vertex x=(x_1, \cdots, x_K) by choosing a coordinate j \in \{ 1,2,\cdots, d\} uniformly at random and setting x_j \to (1-x_j).

The degree of each vertex is d, the number of vertices is 2^d and the number of edges is 2^d \frac{d}{2} so \pi(\mathbf{x}) \,=\, \frac{1}{2^d},the uniform distribution on Q_d.

Example 5.5 (Assymetric random walks on {0,1, , N}) State space S=\{0,1,\cdots,N\} and the Markov chain goes up by 1 with probability p and down by 1 with probaility 1-p. P(j, j+1) =p,\, P(j, j-1) = 1-p \,, \quad \textrm {for } j=1, \cdots, N-1 We can pick different boundary conditions (BC) at 0 and N:

Absorbing BC: P(0,0)=1, P(N,N)=1
Reflecting BC: P(0,1)=1, P(N,N-1)=1
Partially reflecting BC: P(0,0)=(1-p)\,, P(0,1)=p P(N,N-1)=(1-p)\,, P(N,N)=p
Periodic BC: P(0,1)=p\,, P(0, N)=(1-p)\,, \quad P(N, 0) =p\,, P(N, N-1)= (1-p)

Example 5.6 (Ehrenfest urn model) Suppose d balls are distributed among two urns, urn A and urn B. At each move one ball is selected uniformly at random among the d balls and is transferred from its current urn to the other urn. If X_n is the number of balls in urn A then the state space is S=\{0, 1, \cdots, d\} and the transition probabilities P(j,j+1) \,=\, \frac{d-j}{d}\,, \quad P(j,j-1) \,=\, \frac{j}{d} \,. We show that the invariant distribution is ninomial with paramters (d,\frac{1}{2}), that is \pi(j) \,=\, {d \choose j} \frac{1}{2^d}.

\begin{aligned} \pi P (j) &= \sum_{k} \pi(k) P(k,j) = \pi(j-1) P(j-1,j) + \pi(j+1) P(j+1, j) \\ &= \frac{1}{2^d}\left[ {d \choose j-1} \frac{d - (j-1)}{d} + {d \choose j+1} \frac{j+1}{d} \right] = {d \choose j} \frac{1}{2^d} \,. \end{aligned}

This Markov chain is closely related to the simple random walk Y_n on the hypercube Q_d. Indeed selecting randomly one of the d balls and moving it the other urn is equivalent to selecting a random coordinate y_k of \mathbf{y}=(y_1, \cdots, y_d) and changing it to 1-y_k. If we denote by j= |\mathbf{y}|=y_1+ \cdots + y_d to be the number of 1s in \mathbf{y}. Then P(X_n=j+1|X_n=j) = P(\textrm{ choose } k \textrm{ such that } y_k=0) = \frac{d-j}{d}

6 Ergodic theory of finite state space Markov chains

We present the basic convergence theory for finite state Markov chains and introduce the concept of irreducubility and period of a Markov chain. This is very classical stuff, see e.g. Lawler (2006) and Levin et al. (2017).

Existence of stationary distributions

We first show that stationary distributions always exist for finite state Markov chains. This will not be the case if the state space is countable.

Theorem 6.1 Let X_n be a Markov chain on a finite state space S. Then X_n has at least one stationary distribution.

Proof. Boltzano Weierstrass theorem asserts that any bounded sequences \{\mathbf{x}_k\} in \mathbb{R^d} has a convergence subsequence.

Choose \mu_0 to be an arbitrary initial distribution and let \mu_n = \mu P^n. Apply Boltzano Weierstrass to the sequence \mu_n will not work but consider instead the time averages \nu_n \,=\, \frac{\mu + \mu P + \cdots \mu P^{n-1}}{n} The sequence \nu_n is bounded since \mu_n and \nu_n are probability vector. Therefore there exists a convergent subsequence \nu_{n_k} with \nu_{n_k}=\pi and \pi is a probability vector as well. We show that \pi is a stationary distribution. Note \nu_n P - \nu_n \,=\, \frac{ \mu P + \cdots \mu P^{n} - \mu - \cdots -\mu P^{n-1}}{n} \,=\, \frac{\mu P^n - \mu}{n} \quad \textrm{ and thus } \quad \left\vert \nu_n P(j) - \nu_n(j) \right\vert \,\le \, \frac{1}{n} To conclude we note that | \pi P(j) - \pi(j)| \,=\, \lim_{k \to \infty} | \nu_{n_k}P (j) - \nu_{n_k}(j) | \,\le\, \lim_{k \to \infty} \frac{1}{n_k} \,=\, 0 \,. and thus \pi P = \pi, that is \pi is a stationary distribution.

Uniqueness of stationary distributions

It is easy to build examples of Markov chain with multiple stationary distributions.

P \,=\, \left( \begin{array}{cccc} \frac{1}{2} & \frac{1}{2} & 0 & 0 \\ \frac13 & \frac23 & 0 & 0 \\ 0 & 0 & \frac16 & \frac56 \\ 0 & 0 & \frac45 & \frac15 \end{array} \right) \quad \quad \pi_1=\left( \frac35 , \frac 25 ,0,0\right) \textrm{ and } \pi_2= \left( 0, 0 ,\frac{25}{49}, \frac{24}{49}\right) \quad \textrm{ are stationary} If the Markov chain starts in state 1 or 2 it will only visits states 1 and 2 in the future and so we can build a stationary distribution (by Theorem 6.1 using an initial \mu_0 which vanishes on 3 and 4 ) which will vanish on the state 3 and 4.

Notations:
- We say that j is accessible from i, symbolically i \rightsquigarrow j if there exists n\ge 0 such that P^n(i,j)>0.
- We say that i and j communicate , symbolically i \leftrightsquigarrow j if i \rightsquigarrow j and j \rightsquigarrow i.
- A Markov chain X_n is irreducible if every state i\in S communicate with every other state j \in S that is for any pair of states i,j in S there exists n such that P^n(i,j) >0.
In an irreducible Markov chain starting from any state you will eventually visit any other state.

Theorem 6.2 (Uniqueness of stationary distributions) Suppose X_n is an irreducible Markov chain with finite state space S.

If \pi is a stationary distribution then \pi(j)>0 for all j\in S,
Suppose h is a column vector such that Ph=h then h=c(1,1,\cdots,1)^T is a constant vector.
The Markov chain X_n has a unique stationary distribution \pi.

Proof. For 1. if \pi is stationary distribution then \pi(i)>0 for at least one i. If j is such that i \rightsquigarrow j then there exists a time r such that P^r(i,j) >0 and thus \pi(j) =\pi P^r(j)\,=\, \sum_{k} \pi(k) P^r(k,j) \,\ge \, \pi(i) P^r(i,j) >0 \,. Since X_n is irreducible, \pi(j)>0 for any j \in S.

For 2, suppose that Ph=h and i_0 such that h(i_0)= \max_{i\in S} h(i)\equiv M. If h is not a constant vector there exists j with i_0 \rightsquigarrow j but h(j) < M and P^r(i_0,j>0)>0. Since P^rh=h, M= h(i_0) \,=\, P^r h(i_0) \,=\, P^r( i_0, j) \underbrace{h(j)}_{< M} + \sum_{l \not= i_0} P^n(i_0,l) \underbrace{h(l)}_{\le M} < M\sum_{l} P^(i_0,l)=M\,, and this is a contradiction.

For 3. part 2. shows that the geometric multiplicity of the eigenvalue 1 matrix P is equal to 1 and thus so is the geometric multiplicity of the eigenvalue 1 for the adjoint P^T and thus P has at most one left eigenvector \pi.

Convergence to stationary distribution

Question: If we have a unique stationary distribution is it correct that \lim_{n \to \infty} \mu P^n = \pi?
Without further assumption the answer is NO, because of possible periodic behavior. Consider for example the random walk on \{1, \cdots, 2N\} with, for example, periodic boundary conditions. The stationary distribution is uniform \pi= \left(\frac{1}{2N}, \cdots, \frac{1}{2N}\right). But if X_0 is even then X_1 is odd and then alternating periodically between odd and even positions. For example for

P \,=\, \left( \begin{array}{cccc} 0 & 1/4 & 0 & 3/4 \\ 3/4 & 0 & 1/4 & 0 \\ 0 & 3/4 & 0 & 1/4 \\ 1/4 & 0 & 3/4 & 0 \end{array} \right) \quad \quad \pi = \left( \frac14,\frac14 ,\frac14,\frac14 \right) \textrm{ is stationary} and

P^{10}=\begin{pmatrix} 0.4995 & 0 & 0.5004 & 0 \\ 0 & 0.4995 & 0 & 0.5004 \\ 0.5004 & 0 & 0.4995 & 0 \\ 0 & 0.5004 & 0 & 0.4995 \\ \end{pmatrix} \quad \quad P^{11}= \begin{pmatrix} 0 & 0.5002 & 0 & 0.4997 \\ 0.4997 & 0 & 0.5002 & 0 \\ 0 & 0.4997 & 0 & 0.5002 \\ 0.5002 & 0 & 0.4997 & 0 \\ \end{pmatrix} and P^n(i,j) does not converge, oscillates, asymptotically between 0, and 1/2.

A Markov chain X_n is called irreducible and aperiodic if there exists an integer n such that P^n(i,j) > 0 for all pair i,j in S. (The meaning of the terminology will become clearer later on.)

Theorem 6.3 (Doeblin’s Theorem) Suppose X_n be an irreducible and aperiodic Markov chain on the finite state space S with stationary distribution \pi. There exists a constant C>0 and number \alpha with 0\le \alpha< 1 such that for any initial distribution \mu we have |\mu P^n(j) - \pi(j)| \,\le\, C \alpha^n \,, \tag{6.1} i.e., the distribution of X_n converges, exponentially fast, to \pi.

Proof. Since the Markov chain is irreducible and aperiodic we can find an integer r such that P^r has strictly positive entries. Let \Pi be the stochastic matrix \Pi \,=\, \left( \begin{array}{cccc} \pi(1)& \pi(2)& \cdots& \pi(N) \\ \pi(1)& \pi(2)& \cdots& \pi(N) \\ \vdots& \vdots & &\vdots \\ \pi(1) &\pi(2)& \cdots& \pi(N) \end{array} \right) where every row is the stationary distribution \pi. Note that this corresponds to independent sampling from the stationary distribution.

Since all elements of P^r(i,j) are strictly positive we can pick \delta >0 sufficiently small such that for all i,j P^r(i,j) \,\ge \, \delta\,\Pi(i,j)=\delta \pi(j) \,. \tag{6.2}

Let us set \theta = 1-\delta and by Equation 6.2 we define a stochastic matrix Q through the equation P^r = (1-\theta) \Pi + \theta Q \,. Note the following facts:

Since \pi P = \pi we have \Pi P^n = \Pi for any n \ge 1.
For any stochastic matrix M we have M \Pi = \Pi since all rows of \Pi are identical.

Using these two properties we show, by induction, that any integer k\ge 1, P^{kr} \,=\, (1 - \theta^k) \Pi + \theta^k Q^k. This is true for k=1 and so let us assume this to be true for k. We have \begin{aligned} P^{r(k+1)} = P^{rk} P^r &= \left[ (1 - \theta^k) \Pi + \theta^k Q^k \right] P^r = (1 - \theta^k) \Pi P^r + \theta^k Q^k [(1-\theta) \Pi + \theta Q] \\ &= (1 - \theta^k) \Pi + \theta^k(1-\theta) \Pi + \theta^{k+1} Q^{k+1} = (1 - \theta^{k+1}) \Pi + \theta^{k+1} Q^{k+1} \end{aligned} and this concludes the induction step.

From this we conclude that P^{rk} \to \Pi as k \to \infty. We write any integer n as n=k r +l where 0 \le l < r. We have then P^n = P^{kr} P^l = [(1 - \theta^k) \Pi + \theta^k Q^k] P^l = \Pi + \theta^{k} \left[ Q^k P^l - \Pi \right] and thus |P^n(i,j) - \pi(j)| \,=\, \theta^k |Q^k P^l(i,j) - \Pi(i,j)| \le \theta^k \le \frac{1}{\theta} (\theta^{1/r})^n \equiv C\alpha^n \,. The same holds for any initial distribution multiplying by \mu(i) and summing.

The period of a Markov chain

As we have seen before Markov chain can exhibit periodic behavior and circle around various part of the state space.

The period of state j, \textrm{per}(j), is the greatest common divisor of the set {\mathcal T}(j) \,=\, \left\{ n\ge 1\,,\, P^n(j,j)>0 \right\} that is {\mathcal T}(j) is set of all times at which the chain can return to j with positive probability.

For example for random walks with periodic boundary conditions the period is 2 is the number of sites is even and 1 if the number of sites is odd.

The set {\mathcal T}(j) is closed under addition: n,m \in {\mathcal T}(j) \implies m+n \in {\mathcal T}(j) which holds true since P^{n+m}(j,j) \ge P^{n}(j,j)P^{m}(j,j).
By definition if \textrm{per}(j)=d then {\mathcal T}(j) \subset \{ 0,d,2d, \cdots\} Since it is closed under addition a fact from number theory (see exercise) implies that there exists k such that \{ kd, (k+1)d, (k+2)d \cdots \} \subset {\mathcal T}(j) \quad \textrm { or equivalently } \quad P^{nd}(j,j)>0 \textrm{ for all } n\ge k

Lemma 6.1 If i \leftrightsquigarrow j then \textrm{per}(i)=\textrm{per}(j).

Proof. Suppose \textrm{per}(j)=d and i \leftrightsquigarrow j. There exist integers m and m such that P^m(i,j)>0 and P^n(j,i)>0 and thus P^{m+n}(i,i)>0 \textrm{ and }P^{m+n}(j,j)>0 which means that m+n \in {\mathcal T}(i) \cap {\mathcal T}(j) and so m+n=kd for some integer k.

Suppose l \in {\mathcal T}(i) then P^{n+m+l}(j,j) \ge P^n(j,i) P^l(i,i) P^m(i,j) > 0 and n+m+l \in {\mathcal T}(j). Therefore d divides l since it divides both n+m+l and n+m and this shows that \textrm{per}(j) \le \textrm{per}(i). Reversing the roles of i and j we find \textrm{per}(j) = \textrm{per}(i). \quad \blacksquare

Consequences:

If X_n is irreducible then all states have the same period and we can define the period of the irreducible Markov chain X_n .
If X_n is irreducible and has period 1 (also called aperiodic) and S is finite then P^{n}(i,j) >0 for all suficiently large n (compare with our earlier definition of irreducible and paeriodic).

Decomposing the state space S = G_1 \cup G_2 \cup \cdots G_d:

Start with a state i\in S and let i \in G_1.
For all j such that P(i,j)>0 let j \in G_2.
For all j\in G_2 and all k such that P(j,k)>0 let k\in G_3.
After assigning set to G_d assign the next states to G_1.
Repeat until all states have been assigned. If |S| is finite this takes at most |S| steps.

By construction since the period j we get a partition of S into d distinct subsets G_1, \cdots, G_d and P(i,j)> 0 \implies i \in G_k \textrm{ and } j \in G_{k+1 (\textrm{mod} d)} For example

P= \begin{matrix} 1\\2\\3\\4\\5\\6\\7 \end{matrix} \begin{pmatrix} 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & .2 & 0 & 0 & .8 & 0 \\ .3 & .7 & 0 & 0 & 0 & 0 & 0 \\ .2 & .8 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & .7 & .3 & 0 \\ \end{pmatrix}

Iteratively we find 1 \rightsquigarrow 4 \rightsquigarrow 3,6 \rightsquigarrow 1,2 \rightsquigarrow 4,7 \rightsquigarrow 3,5,6 \rightsquigarrow 1,2 \rightsquigarrow 4,7 and thus the Markov chain is irreducible with period 3 and so we have
S=\{1,2\} \cup \{4,7\} \cup \{3,5,6\}

Relabeling the state the transition matrix P takes the following block matrix form P\,=\, \begin{matrix} G_1 \\ G_2 \\ \vdots\\ G_{d-1}\\ G_d \end{matrix} \begin{pmatrix} 0 & P_{G_1 G_2} & 0 & \cdots &0 \\ 0 & 0 &P_{G_2 G_3} & \cdots &0 \\ \vdots & \vdots & & \vdots & \vdots \\ 0 & 0 & \cdots & 0 &P_{G_{d-1} G_d} \\ P_{G_dG_1}& 0 & \cdots & 0 & 0 \end{pmatrix}

where P_{G_l G_{l+1}} describe the transition between states in G_l and G_{l+1}.

Putting the matrix P to the power d we find the block diagonal form P^d\,=\, \begin{matrix} G_1 \\ G_2 \\ \vdots\\ G_d \end{matrix} \begin{pmatrix} Q_{G_1} & 0 & 0 & \cdots &0 \\ 0 & Q_{G_2} & 0 & \cdots & 0 \\ \vdots & \vdots & & & \vdots \\ 0 & 0 & \cdots & 0 & Q_{G_d} \end{pmatrix}

Theorem 6.4 If X_n is an irreducible Markov chain on a finite state space with stationary distribution \pi then we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} P^k(i,j) \,=\, \pi(j) \,.

Proof. The matrix Q_{G_l}, restricted to state space G_l governs the evolution of an irreducible aperiodic Markov chain and thus, by Theorem 6.3, we have \lim_{n \to \infty} P^{nd}(i,j) \,=\,\lim_{n \to \infty} Q^n_{G_l}(i,j) = \pi_{G_l}(j)\,\quad \textrm{ for } i \in G_l, j \in G_l where \pi_{G_l} is the stationary distribution for Q_{G_l}. Morever P^{l}(i,j)=0 if l\not=nd for some n.

If i \in G_{l-1} and j \in G_{l} then \lim_{n \to \infty} P^{nd+1}(i,j) \,=\, \lim_{n\to \infty} \sum_{k \in G_{l}} P(ik) P^{nd}(k,j) \,=\, \sum_{k \in G_{l}} P(ik) \pi_{G_l}(j) \,=\, \pi_{G_l}(j)\,, and P^{m}(i,j)=0 if m\not=nd+1 for some n. Similarly if i \in G_{l-r} and j \in G_{l} we have \lim_{n \to \infty} P^{nd+r}(i,j) \,=\, \pi_{G_l}(j)\,. and P^{m}(i,j)=0 if m\not=nd+r for some n.

The sequence P^n(i,j) is asymptotically periodic where d-1 successive 0 alternates with a number converging to \pi_{G_l}(j) and thus if we define \pi \equiv \frac{1}{d} ( \pi_{G_1}, \cdots, \pi_{G_d}) then \pi is normalized, stationary and \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n P^k(i,j) \,=\, \pi(k) since the time spend in state j is asymptotically equal to \frac{1}{d} \pi_{G_l}(j).

Hitting and return times

We now have good understanding the asymptotic behavior of P^n(i,j) or \mu P^n as n \to \infty. When simulating or observing a Markov chain we usually observe a single sequence X_0, X_1, X_2, \cdots and our next goal is to derive a law large numbers for this sequence of random variables. To do this it will be useful to track how often the Markov chain visits some state.

Notations:

Hitting time for the state j, \sigma(j), is the first time the Markov chain visits the state j. \sigma(j) = \inf\{ n \ge 0\,;\, X_n=j \} \quad \textrm{hitting time }

Return time for the state j, \tau(j), is the first time the Markov chain returns to the state j. \tau(j) = \inf\{ n \ge 1\,;\, X_n=j \} \quad \textrm{return time }

Hitting time and return are closely related and are distinct only from the way they consider what happen at time 0 (both are useful later on). They are example of stopping time and have the crucial property that to decide the events \tau(j)=k only depends on X_0, \cdots X_k. In particular we have the strong Markov property P( X_{\tau(j)+1}=l| \tau(j)=k, X_0=i_0, \cdots, X_k=i_k) = P( X_1=l| X_k=j) which means that the Markov chains starts afresh after the random return time.

Irreducibility means that starting from i the Markov chain will reach j and thus for any i we have P\{\tau(j) <\infty |X_0=i\} >0 If the state space is finite then we have something much stronger. Not only we have P\{\tau(j) <\infty |X_0=i\} =1, the expectation of \tau(j) is actually finite.

Lemma 6.2 For an irreducible Markov chain on the finite state space S the expected return time E[\sigma(j)|X_0=i] < \infty

Proof. By irreducbility the Markov chain can reach the state j from any state i in a finite time. Since S is finite this time is uniformly bounded. So there exist a time m and \varepsilon>0 such that P^l(i,j)\ge \epsilon for some l \le m uniformly in i \in S. Using the strong Markov property this implies that P\{\sigma(j) > k m|X_0=i\} \le (1-\varepsilon) P \{ \sigma(j) > (k-1) m |X_0=i\}\le \cdots \le (1-\varepsilon)^k

Therefore, using that P\{\sigma(j) > n|X_0=i\} is a decreasing function of n E[\sigma(j)|X_0=i] = \sum_{n \ge 0} P\{\sigma(j) > n|X_0=i\} \le m \sum_{k \ge 0} P\{\sigma(j) > km|X_0=i\}\le m \sum_{k \ge 0} (1-\varepsilon)^k < \infty \quad \blacksquare

Note that if S is infinite this is not necessarily true and this will lead to the concepts of transience, recurrence and positive recurrence.

Strong law of large numbers for Markov chains

We will also need the random variable Y_n(j) which counts the number of visits to starte j up to time n that is Y_n(j) \equiv \sum_{k=0}^{n-1} \mathbf{1}_{\{X_k=j\}} \quad = \textrm{ the number of visits to the state } j \textrm{ up to time } n

Since E \left[ \mathbf{1}_{\{X_k=j\}} \vert X_0=i \right] =P^k(i,j) and we have \frac{1}{n}E[Y_n(j)\vert X_0=i]= \frac{1}{n}\sum_{k=0}^{n-1} P^k(i,j) and thus by Theorem 6.3 or Theorem 6.4 \pi(j) \,=\, \lim_{n\to \infty} \frac{1}{n}E\left[ \sum_{k=1}^{n} {\bf 1}_{\{X_k=j\}} \right] \quad = \textrm{ the proportion of time spent in state } j \textrm{ in the long run } This is an important intuition

\pi(j) = \textrm{expected fraction of time that the Markov chain spends in state } j \textrm{ in the long run}. The next results strengthen that intepretation.

Theorem 6.5 (Ergodic Theorem for Markov chain) Let X_n be an irreducible aperiodic Markov chain with arbitrary initial condition \mu, then, with probability 1 we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_n=j\}} \,=\, \pi(j) \,, this is the ergodic theorem for Markov chain.

Moreover if \tau(j) is the first return time to j we have the Kac’s formula \pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,.

Proof. For a Markov chain starting in some state i consider the successive return to the state j. By the strong Markov property, one the Markov chain has reached j it starts a fresh. So the time when the Markov chain returns to state j for the k^{th} time is T_k(j)\,=\, \tau_1(j) + \cdots+ \tau_k(j) \,, where \tau_l(j) are independent copies of the return times \tau(j). For l\ge 2 \tau_l^{(j)} is conditioned on starting at j while for l=1 it depends on the initial condition. Note that by the strong LLN for IID random variables we have \lim_{k\to \infty}\frac{T_k(j)}{k}\,=\, \lim_{k \to \infty} \frac{1}{k} \left( \tau_1(j) + \cdots+ \tau_k(j) \right) \,=\, E[\tau(j)|X_0=j] \,.

Note further that if the total number of visits to state j up to time is k, it means that that we have returned exactyl k times and so \left\{ Y_n(j)=k \right\} =\left\{ T_{k}(j) \le n < T_{k+1}(j) \right\} So we have \frac{ T_{Y_n(j)}(j) } {Y_n(j)} < \frac{n}{Y_n(j)} \le \frac{T_{Y_n(j)+1}(j)}{Y_n(j)+1} \frac{Y_n(j)+1}{ Y_n(j)} As n \to \infty Y_n(j) \to \infty too and taking n \to \infty both extremes of the inequality converge to E[\tau(j) \vert X_0=j] with probability 1 and thus we conclude that, with probability 1, \lim_{n \to \infty} \frac{Y_n(j)}{n} \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,. On the other hand we know that \lim_{n \to \infty} \frac{E[Y_n(j)]}{n} \,=\, \pi(j) \,, and thus \pi(j) \,=\, \frac{1}{E[\tau(j)|X_0=j]} \,. Putting all the pieces together gives the theorem. \quad \blacksquare.

7 Transient behavior of Markov chains

Decomposition of state space

We drop the assumption of irreducibility and develop a number of tool to study the transient behavior of Markov chains.

The communication relation i \leftrightsquigarrow j is an equivalence relation. We use the convention P^0 = I (the identity matrix) and then also that i \leftrightsquigarrow i (i communicates with itself P^0(i,i)=1).
1. It is reflexive : i \leftrightsquigarrow i.
2. It is symmetric: i \leftrightsquigarrow j implies j \leftrightsquigarrow i.
3. It is transitive: i \leftrightsquigarrow j and j \leftrightsquigarrow l implies i \leftrightsquigarrow l.

Using this equivalence relation we can decompose the state space S into mutually disjoint communication classes S = C_1 \cup C_2 \cup \cdots \cup C_M \,.

There are two kinds of communication classes, transient classes and closed classes:

A class C is called transient if there exists i\in C and j \in S \setminus C with i \rightsquigarrow j.

This means it is possible to exit the class C and never come, since it could come back it would mean j \rightsquigarrow k for k \in C and thus j \in C.
A class C is called closed classes if it is not transient that is, for any pair i\in C and j \in S \setminus C we have i \not\rightsquigarrow j.

Clearly it is impossible to exit a closed class.

The next result state if you start in a finite transient class you will always eventually exit it (this may not be true any more S is infinite).

Lemma 7.1 Suppose S is finite and X_0 \in C where C is a transient class. Then X_n exits C after a finite time with probability 1. As a consequence for i, j \in C we have \lim_{n\to \infty} P^n(i,j) =0 \,.

Proof. By irreducibility X_n can exit C starting from any i\in C after a finite time. Since C is finite, there exists
k and \theta < 1 such that have P \{X _k \in C | X_0=i \} \le \theta \,\, \textrm{ for all } i \in C \,. Using the (strong) Markov property this implies that P \{X _{nk} \in C | X_0=i \} \le \theta^n and so so the probability to stay in transient class goes to 0 as time goes by. If i and j both belong to the transient class C this can be re-expressed as \lim_{n\to \infty} P^n(i, j) \,=\, 0 \,.

\quad \blacksquare.

Communication diagram

To figure out the class structure it is convenient to build a directed graph where the vertices are the state and each directed edges corespond to a pair (i,j) with P(i,j)>0.

For example

P= \begin{matrix} 1 \\ 2 \\ 3\\4\\5\\6\\7\\8 \end{matrix} \begin{pmatrix} 0 & .4 & .6 & 0 & 0 & 0 & 0 &0 \\ .3 & .7 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & .5 & 0 & 0 & 0 & .5\\ 0 & 0 & .2 & 0 & .8 & 0 & 0 &0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 &0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 &0 \\ 0 & 0 & 0 & 0 & 0 & .3 & 0 &.7 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 &0 \end{pmatrix}

There are 4 classes: T_1=\{1,2\}, T_2=\{3,4\}, T_3=\{5\} are all transient and C=\{6,7,8\} which is closed.

Decomposition of reducible Markov chains

Markov chain with recurrent classes R_1, \cdots R_L and transient classes denoted by T_1, \cdots T_K (set T= T_1 \cup \cdots T_K). After reordering the states the transition matrix has the block form P\,=\, \begin{matrix} R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T \end{matrix} \begin{pmatrix} P_1 & & & & &\\ & P_2 & & 0 & &\\ & & P_3 & & & \\ & 0 & & \ddots & &\\ & & & &P_L &\\ & & S & & & Q \end{pmatrix} \tag{7.1}

where P_l gives the transition probabilities within the class R_l, Q the transition within the transient classes and S\not= 0 the transition from the transient classes into the recurrent classes.

It is easy to see that P^n has the form P^n\,=\, \begin{matrix} R_1 \\ R_2 \\ R_3 \\ \vdots \\ R_L \\ T \end{matrix} \begin{pmatrix} P_1^n & & & & &\\ & P_2^n & & 0 & &\\ & & P_3^n & & & \\ & 0 & & \ddots & &\\ & & & &P_L^n &\\ & & S_n & & & Q^n \end{pmatrix} for some matrix S_n.

Example: Consider the random walk with absorbing boundary conditions (see Example 5.5). There are three classes, 2 closed ones
\{0\}, \{N\} and 1 transient \{1, \cdots, N-1\} for example with N=5 we have

P\,=\, \begin{matrix} 0 \\ 5 \\ 1 \\ 2 \\ 3 \\ 4 \end{matrix} \begin{pmatrix} 1 & 0 & 0 & 0 &0 & 0\\ 0 & 1 & 0 & 0 & 0& 0\\ 1/2 & 0 & 0 & 1/2 & 0 & 0 \\ 0 & 0 & 1/2 & 0 & 1/2 & 0 \\ 0 & 0 & 0 & 1/2 & 0 & 1/2\\ 0 & 1/2 & 0 & 0 & 1/2 & 0 \end{pmatrix} To get a feel of what’s going on we find (rounded to 4 digit precision)

P^{20} = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0.7942 & 0.1953 & 0.004 & 0 & 0.0065 & 0 \\ 0.5924 & 0.3907 & 0 & 0.0104 & 0 & 0.0065 \\ 0.3907 & 0.5924 & 0.0065 & 0 & 0.0104 & 0 \\ 0.1953 & 0.7942 & 0 & 0.0065 & 0 & 0.004 \\ \end{pmatrix}\,, \quad P^{50}=\begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0.8 & 0.2 & 0 & 0 & 0 & 0 \\ 0.6 & 0.4 & 0 & 0 & 0 & 0 \\ 0.4 & 0.6 & 0 & 0 & 0 & 0 \\ 0.2 & 0.8 & 0 & 0 & 0 & 0 \\ \end{pmatrix}

As expected starting in transient class the Markoveventually exit it to reach here of one the two recurrent class. From P^50 one can read that starting in, say state 2 with probability .6 the Markov chain will end up at 0 and with probability .4 the Markov chain will end up in 5. We will learn how to compute these probability later on but we study first how long it takes to exit the closed class.

Absorption time

Starting in a recurrent class R_l the Markov chain stays in R_l forever and its behavior is dictated entirely by the irreducible matrix P_l with state space R_l.
Starting from a transient class the Markov chain will eventually exit the class T and is absorbed into some recurrent class. One basic question is: What is the expected time until absorption?
We define the absorption time \tau_{\textrm{abs}} = \min\{ n \ge 0 \,;\, X_n \notin T \}

Theorem 7.1 (Expected time until absorption) Let j be a transient state and let \tau_{\textrm{abs}} to be the time until the Markov chain reaches some closed class. Then we have E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,. where M=(I-Q)^{-1} = 1 + Q + Q^2 + \cdots and Q is from the decomposition Equation 7.1.

Proof. From Lemma 7.1 we know that Q^n(i,j) \to 0 and so all eigenvalues of Q have absolute values strictly less than 1. Therefore I-Q is an invertible matrix and we can define M = (I-Q)^{-1} = I +Q +Q^2 + Q^3 + \cdots The second equality is the geometric series for matrices which follows from the identity (I+ Q + \cdots Q^n)(1-Q) = I -Q^{n+1} \,. To give a probabilistic interpretation to the matrix M we introduce the random variable
Y(i) \,=\, \sum_{n=0}^\infty 1_{\{X_n=i\}} \quad = \textrm{ total number of visits to state } i If i is transient by Lemma 7.1 Y(i) < \infty with probability 1. If j is another transient state we have E[ Y(i) \,\vert\, X_0=j]= \,E\left[ \sum_{n=0}^\infty {\bf 1}_{\{X_n=i\}} \,\vert\, X_0=j\right] \,=\,\sum_{n=0}^\infty P\left\{ X_n=i \,\vert\, X_0=j\right\} \,=\,\sum_{n=0}^\infty Q^n(i,j) = M(i,j) That is M(j,i) is simply the expected number of visits to i if X_0=j and thus, summing over all the transient states we obtain E[\tau_{abs} |X_0=j] \,=\, \sum_{i\in T} M(j,i) \,. \quad \blacksquare

Hitting time in irreducible Markov chain

Suppose X_n is irreducible and we are interested in computing the expected time to reach one state from another state that is E[ \tau(i)| X_0=j] \textrm{ for } j \not= i.
If i=j then from Theorem 6.5 that E[ \tau(i)| X_0=i]=\pi(i)^{-1} where \pi is the stationary distribution.
If i \not= j we use the following idea. Relabel the states such the first state is i and modify P such that i is an absorbing state
P = \begin{pmatrix} P(i,i) & R \\ S & Q \end{pmatrix} \quad \quad \longrightarrow \quad \quad \widetilde{P} = \begin{pmatrix} 1 & 0 \\ S & Q \end{pmatrix} \tag{7.2}

The fact that P is irreducible implies that S\setminus \{i\} is a transient class for the modified transition matrix \widetilde{P}. The return time \tau(i) starting from i\not= j for the Markov chain with transition P is exactly the same as the absorption time for the Markov chain with transtion \widetilde{P}. Indeed the hitting time does not depend on the submatrix R. Therefore from Theorem 7.1 we obtain immediately

Theorem 7.2 (Expected hitting time in irreducible Markov chain) Let X_n be an irreducible Markov chain. For i \not= j we E[\tau^(i) |X_0=j] \,=\, \sum_{l\in T} M(j,l) \,. where M=(I-Q)^{-1} and Q is given in Equation 7.2 and is obtained by deleting the i^{th} row and i^{th} column from P.

Examples

Example(continued): Random walk with absorbing boundary conditions on \{0,1,\cdots,5\}. We get
Q\,=\, \begin{pmatrix} 0 & 1/2 & 0 & 0 \\ 1/2 & 0 & 1/2 & 0 \\ 0 & 1/2 & 0 & 1/2\\ 0 & 0 & 1/2 & 0 \end{pmatrix} \quad \implies \quad M=(I-Q)^{-1} \,=\, \begin{pmatrix} 1.6 & 1.2 & 0.8 & 0.4 \\ 1.2 & 2.4 & 1.6 & 0.8 \\ 0.8 & 1.6 & 2.4 & 1.2\\ 0.4 & .8 & 1.2 & 1.6 \end{pmatrix} and thus the expected times until absorption are 4 for states 1 and 4 and 6 for states 2 and 3.

Example: Random walk with reflecting boundary conditions (see Example 5.5) with N=5 and we compute E[ \tau(0) | X_0=i]. The stationary distribution is \pi = \left(\frac{1}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{2}{10}, \frac{1}{10}\right) so E[ \tau(0) | X_0=0]=10. For i\not=0 we delete the first row and column from P and have Q\,=\, \left( \begin{array}{ccccc} 0 & 1/2 & 0 & 0 &0 \\ 1/2 & 0 & 1/2 & 0 &0 \\ 0 & 1/2 & 0 & 1/2 &0 \\ 0 & 0 & 1/2 & 0 &1/2 \\ 0 & 0 & 0 & 1 &0 \end{array} \right) \,, \quad M=(I-Q)^{-1} \,=\, \left( \begin{array}{ccccc} 2 & 2 & 2 & 2 & 1\\ 2 & 4 & 4 & 4 &2 \\ 2 & 4 & 6 & 6& 3\\ 2 & 4 & 6 & 8 & 4 \\ 2 & 4 & 6 & 8 & 5 \end{array} \right) and so the expected return times to 1 are 10, 9, 16, 21, 24, 25 respectively.

Absorption probabilities

If X_0 = i \in T belongs to some transient class and if there are more than one closed classes, say R_1, R_2, \cdots, R_L then the Markov may be absorbed in distinct closed class and so we wisht to compute the probabilities P\{ X_n \,\,\textrm{ reaches class } R_l \,|\, X_0=i \}
Without loss of generality we can assume that each closed class is an absorbing state r_1, \cdots r_L (we can always collapse a closed class into a absorbing state since it does not matter which state in the recurrent class we first visit). We denote the transient states by t_1, \cdots, t_M and upon reordering the state the transition matrix has the form P\,=\, \begin{matrix} r_1 \\ \vdots \\ r_l \\ t_1 \\ \vdots \\ t_M \end{matrix} \begin{pmatrix} & & & & & \\ & I & & &0& \\ & & & & & \\ & & & & & \\ & S & & &Q & \\ & & & & & \\ \end{pmatrix} \tag{7.3}
We wish to compute A( t_i, r_l) = P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\}\,. and it will be convenient to set A(r_l, r_l)=1 and A(r_k, r_k)=0 if k\not= l.

Theorem 7.3 (Absorption probabilities in closed classes) For a Markov chain with transition matrix Equation 7.3 where the states t_1, \cdots t_M are transient we have P\{ X_n \, \textrm{ reaches } r_l \,|\, X_0= t_i\} = A(t_i, r_l) \quad \textrm{ with } \quad A = (I -Q)^{-1} S

Proof. We condition on the first step of the Markov chain to obtain \begin{aligned} A( t_i, r_l) &= P \left\{ X_n = r_l {\rm ~eventually~} \vert X_0= t_i\right\} \\ &= \sum_{k \in S} P \left\{ X_n = r_l \textrm{ eventually}, X_1=k \vert X_0= t_i\right\} \\ &= \sum_{ k \in S} P \left\{ X_1 = k \vert X_0= t_i\right\} P \left\{ X_n = r_j {\rm ~eventually~} \vert X_1= k\right\} \\ &= \sum_{ k \in S} P(t_i, k) A( k, r_j) \,=\, P(t_i, r_l) + \sum_{ t_k } P(t_i, t_k) A( t_k, r_j) \,. \end{aligned} If A be the L \times M matrix with entries A(t_i, r_l), then this can be written in matrix form as A \,=\, S + Q A or A \,=\, (I-Q)^{-1} S \,=\, MS \,.

Continuing with the Random walk with absorbing boundary conditions, we get A \,=\, MS \,=\, \left( \begin{array}{cccc} 1.6 & 1.2 & 0.8 & 0.4 \\ 1.2 & 2.4 & 1.6 & 0.8 \\ 0.8 & 1.6 & 2.4 & 1.2\\ 0.4 & .8 & 1.2 & 1.6 \end{array} \right) \left( \begin{array}{cc} 1/2 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 1/2 \end{array}\right) \,=\, \left( \begin{array}{cc} .8 & .2 \\ .6 & .4 \\ .4 & .6 \\ .2 & .8 \end{array}\right) For example from state 2 the probability to be absorbed in 0 is .6, and so on….

Tennis game

We model tennis as a Markov chain: For each point player A will win with probability p and player B will be win with probability q=1-p and we assume all points are independent. Winning a game leads to the Markov chain with following communication diagram.

The Markov chain “moves up” until it reaches one of the 5 states on the top (which we relabel 0,1,2,3,4) and the chains stays on those 5 states forever and performs a random walks with absorbing boundary conditions.
Want the probability that A wins, of course starting from the initial conditions 0-0. Compute this probability by successive conditioning.
Consider the events D_i=\{ X_n \textrm{ reaches the top row in state } i \}
We compute P(D_i) simply by enumerating all the paths leading up to state i starting form the initial state.

We find

P(D_0)=p^4 + 4 qp^4\,, \quad P(D_1)=4p^3q^2\,, \quad P(D_2)= 6p^2q^2\,, \quad P(D_3)= 4p^2q^3\,, \quad P(D_4)=q^4 + 4 pq^4 \quad

Conditioning on the B_i’s: \displaystyle P(A) =\sum_{i=0}^4 P(A|D_i) P(D_i)
Obviously we have P(A|D_0) =1 and P(A|D_4) =0. By conditioning on the first step we find the system of equations \begin{aligned} P(A|D_1) &= p + q P(A|B_2) & & P(A|D_1) = \frac{p (1-pq)}{1-2pq} \\ P(A |D_2) &= pP(A |D_1) + q P(A |D_3) & \implies \quad\quad & P(A|D_2) = \frac{p^2}{1-2pq}\\ P(A |D_3) &= p P(A|D_2) & & P(A|D_3) = \frac{p^3}{1-2pq} \end{aligned}
Putting everyting together

P(A) = \frac{p^4 (1+2q)(1+4q^2)}{1-2pq}

Gambler’s ruin

A gambler is coming to the casino to play the game of dice called craps. At this game the standard bet (pass line bet) has even money odds and the probability to win is \frac{244}{495}=0.49292929..., one of the casino games with best odds.
Suppose that the player start with a fortune of j and bets 1 in every game. Their ultimate goal is to reach a fortune of N and, if they do, they would stop. Of course they could lose all their money before reaching they goal.
The game is described by a random walk on \{0,1, \cdots, N\} with absorbing boundary conditions and we want to compute \alpha_j = P(X_n \textrm{ reaches } N \textrm{ before reaching } 0| X_0=j)
Conditioning on the first step gives the second order homogeneous linear difference equation \alpha_j = p \alpha_{j+1} + (1-p) \alpha_{j-1} \quad \textrm{ with boundary conditions } \alpha_0=0\,, \alpha_N=1
To solve this second order homogeneous linear difference equations we look for solutions of the form \alpha_j=x^j which leads to the equation px^2 - x + (1-p) =0 \quad \implies x_1=1\,, x_2= \frac{1-p}{p} \tag{7.4} This leads to the system of equation (for p\not= \frac12) \alpha_j= C_1 + C_2 \left(\frac{1-p}{p} \right)^j \quad \textrm{ with } \quad \left\{ \begin{array}{r} C_1+C_2=0 \\ C_1 + C_2\left(\frac{1-p}{p} \right)^N=1 \end{array} \right.

Solving gives the Gambler’s ruin probabilities
\alpha_j = \frac{ 1 - \left( \frac{1-p}{p} \right)^j }{1 -\left( \frac{1-p}{p} \right)^N} \quad \quad \textrm{ Gambler's ruin for } q\not= p
If p=\frac{1}{2} the quadratic equation Equation 7.4 has a double root 1 but we note that \alpha_j= j is a second linearly independent solution of \frac{1}{2}\alpha_{j+1} - \alpha_j + \frac{1}{2}\alpha_{j-1}=0. Solving the equation \alpha_j= C_1 + C_2j with the boundary conditions gives \alpha(j) \,=\, \frac{j}{N} \quad \quad \textrm{ Gambler's ruin for } p=\frac{1}{2}
How bad does it get?:For example if p=\frac{244}{495} and you start with a fortune a 50 and want to double it, the probability to succeed is \frac{ 1 - \left( \frac{251}{244}\right)^{50}}{1 -\left(\frac{251}{244}\right)^{100}}=.1955, not so great. You could also be more risky and bet an amount of 10 in which case you probability to succeed i a much better \frac{ 1 - \left( \frac{251}{244}\right)^{5}}{1 -\left(\frac{251}{244}\right)^{10}}=.4647. Even better bet everything and your probability to win is p=.4929. (In casino boldness pays, or loses less).

Limiting cases To get a better handle on the gambler’s ruin formula we slightly rephrase the problem:
- We start at 0.
- We stop whenever we reach W (W is the desired gain) 0r when we reach -L (L is the acceptable losss).
- The state are now j \in \left\{ -L, -L+1, \cdots, \cdots, W-1, W\right\}
We get P(-L, W)\,\equiv\, P\left( {\rm Reach~}W{\rm~before~reaching~} - L {\rm ~starting~from~}0 \right) \,=\, \frac{1 - \left( \frac{1-p}{p} \right)^L}{1 - \left( \frac{1-p}{p} \right)^{L+W}} \,.
The limit L \to \infty describes a player with infinite wealth:
P(-\infty,L)= \left\{\begin{array}{cl}1 & {\rm~if~} p> \frac12 \\ \left(\frac{p}{1-p}\right)^W & {\rm~if~} p< \frac12 \end{array} \right. Even with infinite wealth it is exponentially hard to win W!
The limit W \to \infty describes a player with fortune L who does not stops unless they lose. P(-L,\infty) \,=\, \left\{\begin{array}{cl} 1 - \left(\frac{1-p}{p}\right)^L & {\rm~if~} p > \frac{1}{2} \\ 0 & {\rm~if~} p < \frac{1}{2} \end{array} \right. The probability to play forever is, unsurprisingly, 0.

8 Markov chains with countable state space

Examples

We introduce some basic examples of Markov chains on a countable state space

Example 8.1 (Random walk on the non-negative integers) Let us consider a random walk on the set of nonnegative integers with partially reflecting boundary conditions at 0. The transition probabilities are given by \begin{matrix} 0 \\ 1 \\ 2 \\ \vdots \end{matrix} \begin{pmatrix} q & p & 0 &0 & 0 &\ldots \\ q & 0 & p &0 & 0 &\ldots \\ 0 & q & 0 & p & 0& \ldots \\ \vdots & \vdots & \ddots & \ddots & \ddots& \\ \end{pmatrix} \quad \quad \textrm{ with } q=1-p

Example 8.2 (Discrete-time queueing model) At a service station (think of a cash register), during each time period there is a probability p that an additional customer comes in the queue. The first person in the queue is being served and during each time period there is a probability q that this person exits the queue.

We denote by X_n the number of people in the queue (either in being served or waiting in line). The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are given by P\,=\, \begin{matrix} 0\\1\\2\\ \vdots \end{matrix} \begin{pmatrix} 1-p & p & 0 &0 & \ldots \\ q(1-p) & qp + (1-p)(1-q) & p(1-q) &0 & \ldots \\ 0 & q(1-p) & qp + (1-p)(1-q) & p(1-q) & \\ \vdots& \vdots & \ddots & \ddots & \ddots & \\ \end{pmatrix}

Example 8.3 (Repair shop) A repair shop is able to repair one item on any single day. On day n Z_n items break down and are brought for repair to the repair shop and we assume that Z_n are IID random variables with pdf P\{ Z_n = k\} = a_k for k=0,1,2, \cdots. If X_n denotes the number of item in the shop waiting to be repaired we have X_{n+1} = \max \{ (X_n -1), 0 \} + Z_n The state space is S = \{ 0,1,2,3, \cdots \} and the transition probabilities are P\,=\, \begin{matrix} 0\\1\\2 \\ \vdots \end{matrix} \begin{pmatrix} a_0 & a_1 & a_2 & a_3 & \cdots \\ a_0 & a_1 & a_2 & a_3 & \cdots \\ 0 & a_0 & a_1 & a_2 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \end{pmatrix} \quad \quad \textrm{ with } \sum_{k=0}^\infty a_k =1

Example 8.4 (Success run chain) Imagine a player taking a series of bets labelled 0,1,2, \cdots. The probability to win bet j is p_j. If the player wins bet j they move up to bet j+1 but if they lose they move back to bet 0. If X_n denotes the number of successive winning bets then X_n has state space S = \{ 0,1,2,3, \cdots \} and transition probabilities P\,=\, \begin{matrix} 0\\1\\2 \\ \vdots \end{matrix} \begin{pmatrix} q_0 & p_0 & 0 &0 & \ldots \\ q_1 & 0 & p_1 &0 & \ldots \\ q_2 & 0 & 0 & p_2 & \\ \vdots & \vdots & & \ddots & \ddots \\ \end{pmatrix} \quad \quad \textrm{ with } q_i=1-p_i This Markov chain has allows for nice analytical computations.

Example 8.5 (Simple d-dimensional random walk) The state space S of the Markov chain is the d-dimensional lattice \mathbb{Z}^d. We denote by {\bf e}_i, i=1, \cdots, d the standard orthonormal basis in \mathbb{R}^d. We view \mathbb{Z}^d as the vertex set of a graph and any point {\bf x=(x_1, \cdots, x_d)} is connected by edges to 2d neighbors {\bf x} \pm {\bf e}_i. For the simple random walk we have p({\bf x}, {\bf x} \pm {\bf e}_i) \,=\, \frac{1}{2d} and all the others p({\bf x},{\bf y})=0.

Example 8.6 (Branching process) The branching process, also known as the {} model the evolution over time of populations.
In a unit of time every individual in a population dies and leave behind a random number of descendents.
To describe the Markov chain we will use IID random variables Z_n^{(k)} indexed by $n=0,1,2$ and k=0,1,2,\cdots.
The branching process is given by X_{n+1} \,=\, \sum_{k=1}^{X_n} Z_n^{(k)} which simply says that the each of X_n individuals in the population at time n has a random number $ Z_n^{(k)}$ of descendents. It is not convenient to write down the transition probabilities but we will study this process later using its moment generating function.

Transience/recurrence dichotomy

In Markov chains on a countable state space a new phenomenon occurs compared to finite state space. Suppose the the Markov chain is irreducible (or starts in a closed class), from a state i the Markov chain will return to i with positive probaility but it is also possible that the Markov chain does not return to i and “wander away to infinity”.

We introduce the corresponding definitions of transience and recurrence of a state. Recall that the return time to state i is given by
\tau(i) = \min \{ n \ge 1 \,;\, X_n =i \}\,.

A state i is recurrent if if the Markov chain starting in i will eventually return to i with probability 1, i.e. if P \left\{ \tau(i) < \infty |X_0=i \right\} = 1 .
A state i is transient if it is not recurrent, that is starting in i the Markov chain return to i with probability q<1, i.e., if P \left\{ \tau(i)< \infty |X_0=i \right\} = q < 1 .

Recall also the random variable Y(i) which counts the number of visits to state j: Y(i) = \sum_{k=0}^\infty I_{\{X_k=i\}} \quad \textrm{ with expectation } E[Y(i)|X_0=j] = \sum_{n=0}^\infty P^n(j,i)

Theorem 8.1 (Transience/recurrence dichotomy)

A state i is recurrent \displaystyle \iff P\{ Y(i)=\infty | X_0=i \}=1 \iff \sum_{n=0}^\infty P^{n}(i,i) = \infty.
Moreover if i is recurrent and i \leftrightsquigarrow j then j is recurrent and we have \sum_{n=0}^\infty P^n(i,j) = \infty and P \left\{ \tau(j) < \infty |X_0= i \right\} = 1
The state i is transient \displaystyle \iff P\{ Y(i)<\infty | X_0=i \} =1 \iff \sum_{n=0}^\infty P^{n}(i,i) < \infty Moreover if i is transient and i \leftrightsquigarrow j then j is transient and we have \sum_{n=0}^\infty P^n(i,j) < \infty and P \left\{ \tau(j) < \infty |X_0= i \right\} < 1\,.

Proof. If i is recurrent the Markov chain starting from i will return to i with probability 1, and then by the Markov property will return a second time with probability 1 and therefore infinitely many time with probability 1. This means that Y(i)=\infty almost surely and so \sum_{k} P^{k}(i,i) = + \infty.

If i \leftrightsquigarrow j then then we can find time l and m such that P^l(i,j)>0 and P^m(j,i)>0 and so \sum_{n=0}^\infty P^{n}(j,j) \ge \sum_{n=0}^\infty P^{n+l+m}(j,j) \ge P^m(j,i) \sum_{n=0}^\infty P^n(i,i) P^l(i,j) = \infty\,,

and j is recurrent. A similar argument shows that \sum_{n=0}^\infty P^n(i,j) = \infty.

It is a consequence of irreducibility that P \left\{\tau(i) < \tau(j) |X_0= j \right\} > 0. (Argue by contraduction, if this probability were 0 by the Markov property, the chain would never visits i starting from j). As a consequence 0\,=\, P \left\{ \tau(j) = \infty |X_0= j \right\} \,\ge\, P \left\{ \tau(i)< \tau(j) |X_0= j \right\} P \left\{ \tau(j) = \infty |X_0= i \right\} and therefore P \left\{ \tau(j) < \infty |X_0= i \right\} =1.

On the other hand, if i is transient, by the Markov property again, the random variable Y(i) is a geometric random variable with success probability q<1 which implies that E[Y(i)]=\sum_{k} P^{k}(i,i) = \frac{1}{q} <\infty. This implies all the other equivalence stated. \quad \blacksquare

Transience/recurrence for the simple random walk.

We analyze the recurence and transience propeties for the simple random walk on \mathbb{Z}^d (see Example 8.5).
As we will see this depends on the dimension. To prove recurrence transience here we compute/estimate directly \sum_{n=0}^\infty P^n(0,0) = \sum_{n=0}^\infty P^{2n}(0,0) since X_n is periodic with period 2.

d=1: To return to 0 in 2n steps the Markov chain must take exactly n steps to the left and n steps to the right and thus we have P^{2n}(0,0) = {2n \choose n} \frac{1}{2^{2n}} By Stirling’s formula we have n! \sim \sqrt{2 \pi n}e^{-n} n^n where a_n \sim b_n means that \lim a_n/b_n =1. Thus we have {2n \choose n} \frac{1}{2^{2n}} \sim \frac{1}{2^{2n}} \frac{ \sqrt{2 \pi 2n} e^{-2n} (2n)^{2n}} { 2 \pi n e^{-2n} n^{2n}} = \frac{1}{\sqrt{\pi n}}\,.
Recalling that if a_n \sim b_n then \sum a_n converges if and only if \sum b_n converges we see that the random walk in d=1 is recurrent.

d=2: In dimension 2 to return to 0 in 2n steps the Markov chain must take exactly k steps to the left and k steps to the right and n-k steps up and n-k steps down. Therefore P^{2n}(0,0) = \sum_{k=0}^n \frac{ 2n!}{ k! k! (n-k)! (n-k)!} \frac{1}{4^{2n}} = \frac{1}{4^{2n}} {2n \choose n} \sum_{k=0}^n {n \choose k} {n \choose n-k} Now we claim that \sum_{k=0}^n {n \choose k} {n \choose n-k}= {2n \choose n} as can be seen by the counting the number of ways that a team of n can be formed out of n boys and n girls. Therefore P^{2n}(0,0) = {2n \choose n}^2 \frac{1}{4^{2n}} \sim \frac{1}{\pi n} and thus the simple random walk is recurrent for d=2 since \sum\frac{1}{n} diverges.
d=3: Similarly as in 2 dimension 2 we find P^{2n}(0,0) = \sum_{k,j : k+j \le n} \frac{ 2n!}{ j! j! k! k! (n-k-j)! (n-k-j)!} \frac{1}{6^{2n}} = \frac{1}{2^{2n}} {2n \choose n} \sum_{k,j : k+j \le n} \left( \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} \right)^2 To analyze this quantity we note that, by the multinomial theorem,
\sum_{k,j : k+j \le n} \frac{1}{3^n} \frac{n!}{j! k! (n-j-k)!} =1

Moreover we have \sum_{i} q_i =1 \implies \sum_{i}q_i^2 \le max_{i} q_i and thus we only need to the maximum of \displaystyle \frac{n!}{j! k! (n-j-k)!}. If k_0,j_0 is the maximum then we must have for example \frac{n!}{(j_0-1)! k_0! (n-j_0-k_0+1)!} \le \frac{n!}{(j_0)! k_0! (n-j_0-k_0)!} \implies 2j_0 \le n -k_0 +1 \,. Repeating the same computation with j_0\to j_0+1, k_0\to k_0-1, k_0 \to k_0+1 gives the set of inequalities n- j_0 -1 \le 2 k_0 \le n- j_0 +1 \, \quad \textrm{ and } \quad n- k_0 -1 \le 2 j_0 \le n- k_0 +1 which implies that \frac n3 -1 \le j_0, k_0 \le \frac n3 +1 i.e. j_0 and k_0 are of order n/3. Using Stirling’s formula P^{2n}(0,0) \le \frac{1}{2^{2n}} {2n \choose n} \frac{1}{3^n} \frac{n!}{ (n/3)! (n/3)! (n/3)! } \sim \frac{3 \sqrt{3}}{2} \frac{1}{ (\pi n)^{3/2}} which shows that the random walk is transient in dimension 3.

Transience/recurrence for the succcess run chain

Continuing with Example 8.4 we consider the return time to state 0, \tau(0) whose pdf we cab compute explicitly its p.d.f since to return to 0 the only possible paths are 0 \rightarrow 0, 0 \rightarrow 1 \rightarrow 0, 0 \rightarrow 1 \rightarrow 2 \rightarrow 0, and so on. We set u_n \equiv p_0p_1 \cdots p_{n-1} and find P( \tau(0) = k|X_0=0) = p_0 p_1 p_{k-2}q_{k-1}= p_0 p_1 p_{k-2}(1-p_{k-1}) = u_{k-1} - u_k\,. Therefore P( \tau(0) \le n |X_0=0) = \sum_{k=1}^n P( \tau(0) = k|X_0=0) = (1- u_0) + (u_0-u_1) + \cdots + (u_{n-1} - u_n) = 1 - u_n and P( \tau(0) < \infty|X_0=0) = 1 - \lim_{n\to \infty} u_n and we obtain that \textrm{The success run chain is recurrent if and only if } \lim_{n \to \infty} u_n = \lim_{n \to \infty} p_0 \cdots p_{n-1} =0 To have a better handle on this criterion we need a little result from analysis about infinite products.

Lemma 8.1 With q_k=1-p_k and u_n= \prod_{k=0}^{n-1} u_k

\lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k = 0 \iff \sum_{k=0}^\infty {q_k} = \infty \lim_{n \to \infty} u_n= \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=0}^\infty {q_k} < \infty

Proof. We have
\displaystyle \prod_{k=0}^{n-1} p_k > 0 \iff \infty > - \log (\prod_{k=0}^{n-1} p_k ) =- \sum_{k=1}^n \log p_k = - \sum_{k=1}^n \log (1-q_k) Taking n \to \infty shows that \lim_{n\to \infty} \prod_{k=0}^{n-1} p_k > 0 \iff \sum_{k=1}^\infty \log (1-q_k) \textrm{ converges.} For this to happen we must have \lim_{k \to \infty} q_k =0. But since \lim_{x \to 0} \log(1-x)/x =1 by L’Hospital rule, we have that \sum_{k=1}^n \log (1-q_k) converges if and only if \sum_{k=1}^n q_k converges. \quad \blacksquare

For the chain to be transient q_k must go to 0 fast enough. If say q_i= \frac{1}{i} then the chain is recurrent while if q_i= \frac{1}{i^2} then it is transient.

Another criterion for transience

We add one more method to establish transience. For this pick reference state j and consider the hitting time to state j, \sigma(j) (not the return time but we will play with both.) \alpha(i)\,=\, P \left\{ \sigma(j)< \infty | X_0= i \right\} By definition we have \alpha(j)=1 since \sigma(j)=0 if X_{0} = j. If the chain is transient we must have \alpha(i) < 1 \quad \textrm { for } i \not= j \,. since by Theorem 8.1 for i \not=j_0 we have \alpha(i)= P \left\{ \tau(j) < \infty |X_0= i \right\} < 1.

Let us derive an equation for \alpha(i) by conditioning of the first step. For i \not=j \begin{aligned} \alpha(i) &= P \left( \sigma(j) < \infty | X_0= i \right) = P \left( \tau(j) < \infty | X_0= i \right) \\ &= \sum_{k \in S} P \left( \tau(j) < \infty | X_1=k \right) P(i,k) = \sum_{k \in S} P \left( \sigma(j)< \infty | X_0=k \right) P(i,k) \\ &= \sum_{k \in S} P(i,k) \alpha(k) \end{aligned} and thus \alpha(i) satisfies the equation P\alpha(i) = \alpha(i) \,, \quad i \not= j \,.

Theorem 8.2 (A criterion for transience) An irreducible Markov chain X_n is transient if and only if for some state j_0 there exists a solution for the equation P\alpha(i) = \alpha(i) \textrm { for } i \not= j_0 \tag{8.1} such that \alpha(j_0)=1 \quad \textrm { and } \quad 0 < \alpha(i) < 1 \quad \textrm { for } i \not= j_0 \tag{8.2}

Proof. We have already established the necessity. In order to show the sufficiency assume that we have found a solution for Equation 8.1 and Equation 8.2. Then for i \not= j_0 we have, using repeatedly the equation P\alpha(i)=\alpha(i) \begin{aligned} 1 &> \alpha(i) = P\alpha(i) = P(i, j_0) \alpha(j_0) + \sum_{j \not= j_0} P(i, j) \alpha(j) = P(i, j_0) + \sum_{j \not= j_0} P(i, j) P\alpha(j) \\ &= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j \not= j_0, k \not= j_0} P(i, j) P(j,k) \alpha(k) \\ &= P(i, j_0) + \sum_{j \not= j_0} P(i, j) P(j,j_0) + \sum_{j,k \not= j_0} P(i, j) P(j,k) P(k,j_0) + \cdots \\ &= P(\tau(j_0)=1|X_0=i) + P(\tau(j_0)=2|X_0=i) + P(\tau(j_0)=3|X_0=i) + \cdots \\ &= P(\tau(j_0)< \infty|X_0=i)\,. \end{aligned} which establishes transience. \quad \blacksquare

Transience/recurrence for the RW on \{0,1,2,\}

Continuing with Example 8.1 we use Theorem 8.2. We pick 0 as the reference state and for j \not=0 solve the equation P\alpha(j) = P(j, j-1) \alpha(j-1) + P(j, j+1) \alpha(j+1) = (1-p) \alpha(j-1) + p \alpha(j+1) = \alpha(j) whose solution is (like for the Gambler’s ruin problem) \alpha(j) = \left\{ \begin{array}{cl} C_1 + C_2 \left( \frac{1-p}{p}\right)^j & \textrm{ if } p \not= \frac{1}{2} \\ C_1 + C_2 j & \textrm{ if } p = \frac{1}{2} \end{array} \right. \,. Using that \alpha(0)=0 we find \alpha(j) = \left\{ \begin{array}{cl} (1-C_2) + C_2 \left( \frac{1-p}{p}\right)^j & \textrm{ if } p \not= \frac{1}{2} \\ (1-C_2) + C_2 j & \textrm{ if } p = \frac{1}{2} \end{array} \right. \,. and we see the condition 0 < \alpha(i) < 1 is possible only if (1-p)/p < 1 (that is p > 1/2) and by choosing C_2=1. Thus we conclude \textrm{ The random walk on } \{0,1,2, \cdots\} \textrm{ is } \left\{ \begin{array}{l} \textrm{transient for } p> \frac{1}{2} \\ \textrm{recurrent for } p \le \frac{1}{2} \end{array} \right. \,.

9 Positive recurrent Markov chains

Positive recurrence versus null recurrence

A finite state irreducible Markov chain is always recurrent, P(\tau(i) < \infty |X_0=j)=1 and we have proved Kac’s formula for the invariant measure \pi(i) = E[ \tau(i)|X_0=i]^{-1}, that is the random variable \tau(i) has finite expectation.
For a countable state space it is possible for a Markov chain to be recurrent but that \tau(i) does not have finite expectation. This motivates the following definitions.
A state i is positive recurrent if E[ \tau(i)|X_0=i] < \infty
A state i is null recurrent if it is recurrent but not positive recurrent

We first investigate the relation between recurrence and existence of invariant measures. We first show that if one state j is positive recurrent then there exists a stationary distribution. The basic idea is to decompose any path of the Markov chain into successive visits to the state j. To build up our intuition if a stationary distribution were to exists it should measure the amount of time spent in state i and to measure this we introduce \mu(i) = E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \quad = \textrm{number of visits to $i$ between two successive visits to $j$}.

Note that \mu(j)=1 since X_0=j, and if j is positive recurrent \sum_{i} \mu(i) = \sum_{i\in S} E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] \,=\, E\left[ \tau(j) \,|\, X_0=j \right]

and thus \pi(i)= \frac{\mu(i)}{E[ \tau(i)|X_0=i]} is a probability distribution.

Theorem 9.1 For a recurrent irreducible Markov chain X_n and a fixed state j, \mu(i) = E[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j ] is stationary in the sense that
\mu P = \mu If the state j is positive recurrent then \mu can be normalized to a stationary distribution \pi.

Proof. The chain visits j at time 0 and then only again at time \tau(j) and thus we have the two formulas \begin{aligned} \mu(i) &= E\left[ \sum_{n=0}^{\tau(j)-1} {\bf 1}_{\{X_n=i\}}| X_0=j \right] = \sum_{n=0}^\infty P( X_{n}=i, \tau(j)> n | X_0=j)\\ &= E\left[ \sum_{n=1}^{\tau(j)} {\bf 1}_{\{X_n=i\}}| X_0=j \right] = \sum_{n=1}^{\infty} P( X_n=i, \tau(j) \ge n | X_0=j ) \end{aligned}

We have then, by conditioning on the last step, and using \mu(j)=1 \begin{aligned} \mu(i) &= \sum_{n=1}^{\infty} P( X_n=i, \tau(j)\ge n | X_0=j ) = P(j,i) + \sum_{n=2}^\infty P( X_n =i, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P( X_n =i, X_{n-1}=k, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)\ge n | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{n=2}^\infty P(k,i) P( X_{n-1}=k, \tau(j)> n-1 | X_0=j) \\ &= P(j,i) +\sum_{k \in S, k\not=j} \sum_{m=1}^\infty P(k,i) P( X_{m}=k, \tau(j)> m | X_0=j) \\ &= \mu(j) P(j,i) + \sum_{k\not=j} E\left[ \sum_{m=1}^{\tau(j)-1} {\bf 1}_{\{X_m=k\}}| X_0=j \right] P(k,j) \\ &= \mu(j) P(j,i) + \sum_{k\not=j} \mu(k) P(k,j) \,=\, \sum_{k} \mu(k) P(k,j) \end{aligned} which proves the invariance of \mu. If the chain is positive recurrent we have laready seen that \mu is normalizable. \quad \blacksquare.

Stationarity and irreducibility implies positive recurrence

Theorem 9.2 Assume the irreducible Markov chain has a stationary distribution \pi(i) then \pi(i)>0 for any i and we we have Kac’s formula \pi(i) = E \left[ \tau(i) | X_0=i \right]^{-1}. In particular all states are positive recurrent and the stationary distribution is unique.

Proof. Let us assume that \pi is invariant. We first show that the chain must be recurrent. If the chain were transient then we would have P^n(i,j) \to 0 as n \to \infty and so by dominated convergence \pi(i) = \sum_{j} \pi(j) P^n(j,i) \to 0 \textrm{ as } n \to \infty \,. which is impossible.

The fact that \pi(i)>0 for all i\in S is proved as for finite state space see the argument in Theorem 6.2.

To prove positive recurrence we use a clever argument involving the time reversed chain (more on time reversal in Chapter 12 and in the exercises). Consider the Markov chain with transition matrix Q(i,j)=\frac{\pi(j) P(j,i)} {\pi(i)}. It is easy to verify that Q(i,j) is a transition matrix and that \pi is stationary for Q, \pi Q=\pi. We denote by Y_n the Markov chain with transition matrix Q, since \pi(i)>0 this Markov chain has the same communication structure as X_n and is irreducible since X_n is. By the previous argument Y_n must be recurrent.

Next we write \pi(i) E[ \tau(i) |X_0=i] \,=\, \pi(i) \sum_{n=1}^\infty P (\tau(i) \ge n |X_0=i) The event \{\tau(i) \ge n \} conditioned on \{ X_0=i\} correspond to a sequence of states i_0=i, i_1, \cdots \cdots, i_{n-1}, i_n=j where i_1, \cdots, i_{n-1} cannot be equal to i and j can be any state. Using repeatedly the relation \pi(i)P(i,j)= \pi(j)Q(j,i) the probability of such event can be written as \begin{aligned} \pi(i) P(i,i_1) \cdots P(i_{n-1}, i_n) &= \pi(j) Q(j,i_{n-1}) \cdots Q(i_1, i) \end{aligned} For the Markov chain Y_n this correspond to a path starting in j and returning to i after exactly n steps. Therefore we find

\begin{aligned} \pi(i) E[ \tau(i) |X_0=i] = \pi(i) \sum_{n =1}^\infty P (\tau(i) \ge n |X_0=i) &= \sum_{j \in S} \pi(j)\sum_{n=1}^\infty P ( \tau(i) =n | Y_0 = j) \\ &= \sum_{j \in S} \pi(j) P( \tau(i) < \infty|Y_0 = j) =1 \,. \end{aligned} where for the last equality we have used the recurrence of the time reversed chain. This shows Kac’s formula which implies the uniqueness of the stationary distribution and that all states are positive recurrent. \quad \blacksquare

Ergodic theorem for countable Markov chains

Theorem 9.3 (Ergodic theorem for countable state space Markov chains)

If X_n is irreducible and positive recurrent then there exists a unique stationary distribution \pi and for any initial distribution \mu we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_k=j\}} \,=\, \pi(j) \,, with probability 1. In particular \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} \mu P^k(j) \,=\, \pi(j),. Moroever \pi we have the Kac’s formula \pi(j) =\frac{1}{E[ \tau(j) |X_0= j]}. Conversely if an irreducible Markov has a stationary distribution then it is positive recurrent.

Proof. We have actually already proved all of it. Positive recurrence implies the existence of the stationary distribution (Theorem 9.1) and Kac’s fromula is from Theorem 9.2 which implies uniqueness of the stationary distribution. We can now repeat the proof of Theorem 6.5 to show that if X_0=i with i arbitrary we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} {\bf 1}_{\{X_k=j\}} = \pi(j) \,. \tag{9.1} The reader should verify that the proof of Theorem 6.5 only use positive recurrence and not the finiteness of the state space. Taking now expectation of Equation 9.1 and summing over initial condition we have \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^{n} \mu P^k(j) \,=\, \pi(j). \quad \blacksquare

Convergence to equilibrium via coupling

We continue our theoretical consideration by proving that if the chain is aperiodic then the distribution of X_n converges to \pi(j).

Theorem 9.4 Suppose X_n is an irreducible positive recurrent aperiodic Markov chain. Then for any initial distribution mu we have \lim_{n \to \infty} \mu P^n(j) \,=\, \pi(j) \,.

Proof. We will use a coupling argument: we take two independent copies X_n and Y_n of the Markov chain where X_n is starting in the initial distribution \mu while Y_n is starting in the stationary distribution \pi.

The idea is to consider the coupling time \sigma = \inf\{ n \ge 1 ; X_n = Y_n\} \,. At the (random) time \sigma, X_n and Y_n are in the same state and after that time X_n and Y_n must have the same distribution by the (strong) Markov property. But since Y_n is distributed according to \pi so must X_n be as well and thus, at the coupling time \sigma, X_n has reached stationarity.

Let us now consider the chain Z_n=(X_n,Y_n) with transition probabilities P( Z_{n+1} = (k,l) \,|\, Z_n=(i,j) ) = P(i,k) P(j,l) and stationary distribution \pi( i,j) = \pi(i) \pi(j). Since X_n and Y_n are aperiodic, given states i,j,k,l we can find n_0 such that for every n \ge n_0 we have P^n(i,k) > 0 and P^n(j,l)>0. This implies that Z_n is irreducible and thus, since a stationary measure exists, by Theorem 9.3 the chain Z_n is positive recurrent. Since the coupling time is the first time the Markov chain Z_n hits a state of the form (j,j), recurrence of Z_n implies that P( \sigma < \infty)=1 and thus P(\sigma >n ) \to 0.

To conclude \begin{aligned} \left| \mu P^n(j) - \pi(j)\right| & = \left| P( X_n=j) - P(Y_n =j) \right| \\ & \le \left| P( X_n=j , \sigma \le n ) - P(Y_n =j , \sigma\le n) \right| + \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\ & = \left| P( X_n=j , \sigma > n ) - P(Y_n =j, \sigma > n) \right| \\ & = \left| E[ ({\bf 1}_{\{X_n=j\}} - {\bf 1}_{\{Y_n=j}\}) {\bf 1}_{\{\sigma > n\}} ] \right| \\ & \le E[ {\bf 1}_{\{\sigma > n\}}] = P( \sigma > n) \rightarrow 0 \textrm{ as } n \to \infty \end{aligned} and this prove the convergence. \quad \blacksquare.

The idea of coupling used in Theorem 9.4 is an instance of powerful idea. First we can use other coupling that the one used here and if we can bound the tail behavior of the coupling time then we can control the speed of convergence to the stationary distribution! We will exploit this idea later on.

Examples

Positive recurrence for the random walk on \{0,1,2, \cdots\}. Continuing with Example 8.1, we use Theorem 9.3 to establish positive recurrence by computing the stationary distribution. We find the equations \pi(0) (1-p) + \pi(1)(1-p) = \pi(0) \quad \textrm{ and } \quad \pi(j+1)(1-p) + \pi(j-1)p = \pi(j) \quad \textrm{ for }j \ge 1 The second equation has the general solution \pi(n)=C_1 + C_2 \left(\frac{p}{1-p}\right)^n \,\,\,(\textrm{if }p\not=\frac12) \quad \textrm{ and } \quad \pi(n)=C_1 + C_2n\,\,\, (\textrm{ if } p=\frac12) and the first equation gives \pi(1)=\frac{p}{1-p}\pi(0). For p=\frac{1}{2} we cannot find a normalized solution. For p < \frac{1}{2} we can choose C_1=0 and \pi(n) = \left(\frac{p}{1-p}\right)^n \pi(0) can be normalized toi find \pi(n) = \frac{1-2p}{1-p} \left(\frac{p}{1-p}\right)^n. If p> \frac{1}{2} we already know the Markov chain is transient and thus \textrm{ The random walk on } \{0,1,2, \cdots\} \textrm{ is } \left\{ \begin{array}{l} \textrm{transient for } p> \frac{1}{2} \\ \textrm{null recurrent for } p = \frac{1}{2} \\ \textrm{positive recurrent for } p < \frac{1}{2} \\ \end{array} \right. \,.

Positive recurrence for the success run chain: Continuing with Examples Example 8.4 we determine recurrence by solving \pi P=\pi. This gives the equations \begin{aligned} \pi(0) &= \pi(0) q_0 + \pi(1)q_1 + \cdots = \sum_{n=0}^\infty \pi(n) q_n \\ \pi(1) &= \pi(0) p_0\,, \quad \pi(2) = \pi(1) p_1, \quad \cdots \end{aligned} From the second line we find \pi(n) = \pi(0) p_0 p_1 \cdots p_{n-1} and inserting into the first equation \pi(0) = \pi(0) \left[ (1-p_0) + p_0 (1-p_1) + p_0 p_1(1-p_2) + \cdots \right] = \pi(0) \left[ 1 - \lim_{n \to \infty} p_0 p_1 \cdots p_{n-1}\right] Recall that recurrence occurs provided \lim_{n\to \infty} \prod_{j=0}^{n-1} p_j = 0 and so if X_n is recurrent there exists a solution of \pi P=\pi and we can normalize \pi provided \sum_{n=1}^\infty \prod_{j=0}^{n-1} p_j < \infty Therefore we obtain \textrm{ The success run chain is } \left\{ \begin{array}{l} \textrm{transient if } \lim_{n} \prod_{j=0}^{n-1} p_j >0 \\ \textrm{recurrent if } \lim_{n} \prod_{j=0}^{n-1} p_j = 0 \\ \textrm{positive recurrent if } \sum_{n} \prod_{j=0}^{n-1} p_j < \infty \\ \end{array} \right. \,.

10 Branching processes

Moment generating function

For a random variable X taking values in \{0,1,2,3,\cdots\} the generating function of X is the function \phi_X(s)=E[s^X], defined for s\ge 0. We have \phi_X(s) = E[s^X] = \sum_{k=0}^\infty s^k P\{X=k\} It is closely related to the moment generating function E[e^{tX}] but this parametrization is more useful her.

Elementary properties of \phi_X(s):
1. \phi_X(s) is an increasing function of s with \phi_X(0)=P\{X=0\} and \phi_X(1)=1.
2. Differentiating gives \phi_X'(s) = \sum_{k=1}^\infty k s^{k-1} P\{X=k\}\,, \quad \quad \phi_X''(s) =\sum_{k=1}^\infty k(k-1) s^{k-2} P\{X=k\} and thus \phi_X'(1)=E[X].
  If P\{X \ge 2\} then \phi_X''(s)>0 and \phi_X is strictly convex.
3. If X_1, X_2, \cdots X_m are independent random variables then \phi_{X_1 + \cdots + X_m}(s) = \phi_{X_1}(s) \cdots \phi_{X_1}(s)

Branching process

In a branching process individuals reproduce independently of each other.

During one time period each indvidual dies and leaves k offsprings with probability p(k) for k=0,1,2, \cdots. We denote by X_n the total population after n time period.

0 is an absorbing state and correspond to the extinction of the population. We assume p(0)>0 so the absorbing state ban be reached from other states.

The transition probbailities are not easy to write down explicitly. If X_n=k then X_{n+1} will be the sum of the offsprings of the k indviduals. That is P\{X_{n+1}=j | X_n=k \} = P\{ Y_1 + \cdots + Y_k =j\} where Y_1, \cdots, Y_i are IID random variables with pdf p(k).

Mean population E[X_n] is easy to compute by conditioning. Denoting by \displaystyle \mu=E[Y_1]=\sum_{n=0}^\infty np(n) the mean number of offsprings of the individuals we have E[X_n|X_{n-1}=k] = E[Y_1 +\cdots + Y_k]=k\mu and thus E[X_n] = E[ E[X_{n}|X_{n-1}]] = \mu E[X_{n-1}] =\cdots = \mu^n E[X_0]

If \mu <1 then E[X_n] \to 0 and so the population goes extinct since P\{ X_n \ge 1\} = \sum_{k=1}^\infty P\{X_n=k\} \le \sum_{k=0}^\infty k P\{X_n=k\} =E[X_n] \to 0 and so \displaystyle \lim_{n \to \infty} P\{ X_n=0 \} = 1.

If \mu=1 then the population stays constant but it could go extinct with probability 1 nonetheless (that is P(X_n=0 goes to 1 but E[X_n] is not small). If \mu>1 then the population grows on avergae but it still could go extinct with some non-zero probability.

To avoid triial cases we assume that p(0)>0 (the population can go extinct) and p(0)+p(1)< 1 (the population can grow). We define the extinction probability \begin{aligned} a_n(k) &= P\{ X_n=0 | X_0=k\} \\ a(k) & = \lim_{n \to \infty} P\{ X_n=0 | X_0=k\} = P \left\{ \textrm{population goes extinct} | X_0=k \right\} \end{aligned}
Since for the population to go extinct all branches must go extinct, by independence, we have a(k) = a(1)^k \quad \textrm{ so we set } a \equiv a(1) and we assume from now on that X_0=1. Note this formula is also correct for k=0 since a(0)=1.

Equation for the extinction probability a. By conditioning on the first step \begin{aligned} a & = P \left\{ \textrm{population goes extinct} | X_0=1 \right\} \\ & = \sum_{k=0}^\infty P \left\{ \textrm{population goes extinct} | X_1=k \right\}P\{X_1=k|X_0=1\} \\ & = \sum_{k=0}^\infty a(k) p(k) \\ & = \sum_{k=0}^\infty a^k p(k) = \phi_Z(a) \end{aligned} where \phi_Z(s) is the generating function for the offspring distribution Z. So we have \textrm{ The extinction probability solves the fixed point equation } \phi_Z(a)=a \,.

Generating function for the population X_n. If X_0=1, \phi_{X_0}(s)=s and \phi_{X_1}(s) = \phi_Z(s) since X_1 are the Z descendents of the single intial individiual. In addition for n\ge 2 we find \begin{aligned} \phi_{X_n}(s) &= \sum_{k=0}^\infty P \left\{ X_n= k\right\} s^k \\ &= \sum_{k=0}^\infty \left[ \sum_{j=0}^\infty P \left\{ X_n= k|X_1=j\right\}P\left\{ X_1=j\right\}\right] s^k \\ &= \sum_{j=0}^\infty p(j) \sum_{k=0}^\infty P \left\{ X_n= k|X_1=j\right\} s^k \end{aligned} Now note that X_n conditioned on X_1=j is the sum of j independent copies of X_n conditioned on X_1=1. Therefore the generating of function X_n conditioned on X_1=j is the generating function of X_n conditioned on X_1=1 to the j^{th} power. So we find \phi_{X_n}(s)= \sum_{j=0}^\infty p(j) [\phi_{X_{n-1}}(s)]^s = \phi_Z(\phi_{X_{n-1}}(s)) and thus we get \phi_{X_n}(s) = \phi_Z^n(s) = \phi_Z ( \phi_Z( \cdots (\phi_Z(s))))

Using these computations we are ready to derive the main result

Theorem 10.1 Let Z be random variable describing the distribution of descendants of a single individual in a branching process and we assume p(0)>0 and p(0)+p(1)<1. Then if X_0=1 the extinction probability is the smallest root of the equation \phi_Z(a)=a.

If \mu \le 1 then a=1 and the population eventually dies out.
If \mu>1 then the extinction probability a<1 is the unique root of the equation \phi_Z(s)=s with 0 < s < 1.

Proof. We have aready established the extinction probability a is a root of \phi_Z(s)=s. But we also know that 1 is a root since \phi_Z(1)=1 and the slope of \phi_Z is equal to \mu at s=1. Since p(0)+p(1)<1 then \phi_Z(s) is strictly convex and thus \phi_Z(s) has at most two roots. We have the followoing cases (illustrated in Figure 10.1 on the next slide).

If \mu< 1 the equation \phi_Z(s)=s has two roots 1 and s>1 and the extinction probability is 1.
If \mu= 1 the line s is a tangent to the curve \phi_Z(s) at s=1 and so \phi_Z(s)=s has one roots 1 and the extinction probability is 1.
Finally if \mu> 1 the equation \phi_Z(s)=s has two roots 1 and a<1. Since \phi_Z(0)=p(0)>0 the second root satisfies a>0. To show that the smallest root is the extinction probability note that we have
a_n(1)=P\left\{X_n=0|X_0=1\right\}=\phi_Z^n(0)

By induction we show that a_n(1)\le a. True for n=0 since a_0(1)=0. Assuming that a_{n-1}(1)\le a we have
a_n(1)= P \left\{ X_n=0|X_0=1 \right\}= \phi_Z^n(0) = \phi_Z( \phi_Z^{n-1}(0)) =\phi_Z(a_{n-1}(1)) \le \phi_Z(a) =a where we used that \phi_Z is increasing.This shows that the smallest root is the extinction probability \quad \blacksquare.

Figure 10.1: Extinction probabilities for branching processes

Examples

If p(0)=\frac14, p(1)=\frac14, p(2)=\frac12 \implies \phi(s)=\frac14 + \frac14 s + \frac12 s^2 then \mu=\frac{5}{4}>1 and solving \phi(a)=a gives a=1,\frac12 so the extinction probability is 1/2.
If p(0)=\frac14, p(1)=\frac12, p(2)=\frac11 \implies \phi(s)=\frac14 + \frac12 s + \frac14 s^2 then \mu=1 and solving \phi(a)=a gives a=1 so the extinction probability is 1.
If p(0)=\frac12, p(1)=\frac14, p(2)=\frac14 \implies \phi(s)=\frac12 + \frac14 s + \frac14 s^2 then \mu=\frac{3}{4}<1 and solving \phi(a)=a gives a=1,2 so the extinction probability is 1.

Remark: The theorem provides a numerical algorithm to find the extinction probability (if \mu <1). Indeed we have shown that the sequence 0, \phi_Z(0), \phi^2(0), \phi^3(0), \cdots converges to a.

import numpy as np
def f(s):
    return (1/8) + (3/8)*s + (1/8)*s**2 + (1/8)*s**3 + (2/8)*s**4

s = 0 # initial condition
N = 20 # the number of iterations

for i in range(N):
    s = f(s)
    print(s)

0.125
0.17413330078125
0.19498016827133297
0.20415762609873456
0.20826713424631457
0.2101216307743601
0.21096146654108697
0.21134240863922932
0.21151532651001245
0.2115938436492296
0.21162950143053777
0.21164569616414142
0.2116530515720635
0.21165639233633957
0.21165790969301684
0.21165859887003846
0.21165891189175637
0.21165905406517702
0.2116591186398882
0.21165914796951835

Repair shop example

Recall the Markov chain in Example 8.3. To establish if the system is postive recurrent we try to solve \pi P=\pi and find the system of equations \begin{aligned} \pi(0)&= \pi(0)a_0 + \pi(1)a_0 \\ \pi(1)&= \pi(0)a_1 + \pi(1) a_1 + \pi(2)a_0 \\ \pi(2)& = \pi(0)a_2 + \pi(1)a_2 +\pi(2) a_1 + \pi(3)a_0 \\ & \vdots \\ \pi(n)& = \pi(0)a_n + \sum_{j=1}^{n+1} \pi(j) a_{n+1-j} \end{aligned}

We solve this using the generating functions \psi(s)=\sum_{n=0}^\infty s^n \pi(n) and \phi(s)=\sum_{k=0}s^k a_k: \begin{aligned} \psi(s) = \sum_{n=0}^\infty s^n\pi(n) &= \pi(0) \sum_{n=0}^\infty s^n a_n + \sum_{n=0}^\infty s^n \sum_{j=1}^{n+1} \pi(j) a_{n+1-j}\\ & = \pi(0) \sum_{n=0}^\infty s^n a_n + s^{-1} \sum_{j=1}^\infty \pi(j) s^j \sum_{n=j-1}^\infty s^{n+1-j} a_{n+1-j} \\ &=\pi(0) \phi(s) + s^{-1}(\psi(s)-\pi(0)) \phi(s) \end{aligned}

Solving for \psi(s) we find, after some algebra, the equation \psi(s) = \frac{\pi(0) \phi(s)}{1- \frac{1- \phi(s)}{1-s}}

To see if we can find \pi(0) such that this equation canbe solved we take s \to 1. We have \psi(1)=\phi(1)=1 and we have \displaystyle \lim_{s\to 1}\frac{1- \phi(s)}{1-s}=\phi'(1)=\sum_{k=0}^\infty k a_k=\mu which is the mean number of object arriving in the repair shop in a single day. We find the equation 1= \frac{\pi(0)}{1- \mu} and so we can find a solution 0 < \pi(0) \le 1 if and only if \mu< 1 and so \pi(0)=1-\mu. \textrm{ The repair shop Markov chain is positive recurrent iff } \mu < 1

To study transience we use Theorem 8.2 and try to find a solutinon P \alpha(j) =\alpha(j) for j\ge 1 with \alpha(0)=1 and 0 < \alpha(j) <1 for j>1. We find the system of equation \begin{aligned} \alpha(1) &= a_0 \alpha(0) + a_1 \alpha(1) + a_2 \alpha(2) + \cdots \\ \alpha(2) &= a_0 \alpha(1) + a_1 \alpha(2) + a_2 \alpha(3) + \cdots \\ \alpha(3) &= a_0 \alpha(2) + a_1 \alpha(3) + a_2 \alpha(4) + \cdots \\ & \vdots \\ \alpha(n) & = \sum_{j=0}^\infty a_j \alpha(j+n-1) \end{aligned} We try for a solution of the form \alpha(j)=s^j which gives s^n = \sum_{k=0}^\infty a_j s^{j+n-1} = s^{n-1} \phi(s) \implies \phi(s)=s and from our analys of branching processes we know that a solution with s<1 exists iff \mu= \sum_{k=0}^\infty k a_k >1. So we get \textrm{ The repair shop Markov chain is transient iff } \mu > 1 and it is null-recurrent if \mu=1.

11 Reversible Markov chains

Balance equation

Consider a Markov chain with transition probabilities P(i,j) and stationary distribution \pi(i). We can rewrite the equation for stationarity, \pi(i)=\sum_{j} \pi(j) P(j,i) as
\sum_{j} \pi(i) P(i,j) \,=\, \sum_{j} \pi(j) P(j,i) \,. \tag{11.1} which we are going to interpret as a balance equation.

We introduce the (stationary) probability current from i to j as J(i,j) \equiv \pi(i) P(i,j) \tag{11.2} and Equation 11.1 can be rewritten as \underbrace{\sum_{ j} J(i,j)}_{\textrm{total current out of state } i} \,=\, \underbrace{\sum_j J(j,i) }_{\textrm{total current into of state i}} \, \quad \quad \textrm{ balance equation } \tag{11.3} i.e., to be stationary the total probability current out of i must be equal to the total probability current into i.

Detailed balance

A stronger condition for stationarity can be expressed in terms of the balance between the currents J(i,j). A Markov chain X_n satisfies detailed balance if there exists \pi(i)\ge 0 with \sum_{i} \pi(i)= 1 such that for all i,j we have \pi(i) P(i,j) \,=\, \pi(j) P(j,i) \quad \textrm{ or equivalently } \quad J(i,j) = J(j,i) \tag{11.4} This means that for every pair i,j the probability currents J(i,j) and J(j,i) balance each other.
Clearly Equation 11.4 is a stronger condition than Equation 11.3 and thus we have

Lemma 11.1 If the Markov chain satisfies detailed balance for a probability distribution \pi then \pi is a stationary distribution.

But it is easy to see that detailed balance is a stronger condition than stationarity. The property of detailed balance is often also called (time)-reversibility since we have the following results which states that the probability of any sequence is the same as the probability of the time reversed sequence.

Theorem 11.1 (Time reversibility) Suppose the Markov chain X_n satisfies detailed balance and assume that the initial distribution is the stationary distribution \pi. Then for any sequnce of states i_0, \cdots i_n we have P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\} \,=\, P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\}

Proof. Using Equation 11.4 repeatedly we find \begin{aligned} P \left\{ X_{0}= i_0 \,, X_1=i_1\,, \cdots \, , X_n =i_n \right\} & \,=\, \pi(i_0) P(i_0, i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, P(i_1, i_0) \pi(i_1) P(i_1, i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, P(i_1, i_0) P(i_2, i_1) \pi(i_2) \cdots P(i_{n-1}, i_n) \\ & \,=\, \cdots \\ & \,=\, P(i_1, i_0) P(i_2, i_1) \cdots \pi(i_{n-1}) P(i_{n-1}, i_n) \\ &\ \,=\, P(i_1, i_0) P(i_2, i_1) \cdots P(i_n, i_{n-1}) \pi(i_n) \\ &\,=\,P \left\{ X_{0}= i_n \,, X_1 = i_{n-1}\,, \cdots \, , X_n =i_0 \right\} \end{aligned} \blacksquare

The next result is very easy and very useful.

Theorem 11.2 Suppose X_n is a Markov chain with finite state space S and with a symmetric transition matrix, i.e, P(i,j)=P(j,i). Then X_n satisfies detailed balance with \pi(j) = 1 /|S|, i.e., the stationary distribution is uniform on S.

Examples

Random walk on the hypercube \{0,1\}^m. The state space S is S= \{0,1\}^m \,=\, \left\{ \sigma = ( \sigma_1, \cdots, \sigma_m) \,;\, \sigma_i \in \{0,1\} \right\} To define the move of the random walk, just pick one coordinate j \in \{ 1, \cdots, m\} and flip the j^{th} coordinate, i.e., \sigma_j \to 1- \sigma_j. We have thus P(\sigma, \sigma') =\left\{ \begin{array}{cl} \frac{1}{m} & {\rm if~}\sigma {\rm~and~}\sigma' {\rm~differ~by~one~coordinate} \\ 0 &{\rm otherwise} \end{array} \right. Clearly P is symmetric and thus \pi(\sigma) \,=\, 1/ 2^m.
Random walk on the graph G=(E,V) has transition probabilities p(v, w) = \frac{1}{ {\rm deg}(v)}. This Markov chain satifies detailed balance with the (unnormalized) \mu(v) = deg(v). Indeed we have P(v,w) >0 \iff P(w,v)>0 and thus if P(v,w)>0 we have \mu(v) P(v, w) \,=\, {\rm deg}(v) \frac{1}{{\rm deg} (v)} = 1 \,=\, {\rm deg}(w) \frac{1}{{\rm deg} (w)}= \mu(w) P(w, v) \,. This is slightly easier to verify that the stationary equation \pi P = \pi. After normalization we find \pi(v) = {\rm deg}(v)/2|E|.
For example for the simple random walk on \{0, 1, \cdots, N\} with reflecting boundary conditions we find \pi \,=\, \left( \frac{1}{2N}, \frac{2}{2N}, \cdots, \frac{1}{2N} \right).

Network Markov chain: The previous example can be generalized as follows. For a given graph G=(E,V) let us assign a positive weight c(e)>0 to each (undirected) edge e= \{v,w\}, that is we choose numbers c(v,w)=c(w,v) with c(v,w)=0 iff v and w are not connected by an edge. If the transition probabilities are given by P(v,w) \,=\, \frac{c(v,w)}{c(v)} \quad \textrm{ with } c(v) \,=\, \sum_{w} c(v, w) \,, then it is easy to verify that the Markov chain satisfies detailed balance with \pi(v) \,=\, \frac{c(v)}{c_G} \quad \textrm{ with } c_G \,=\, \sum_{v} c(v) \,.
Birth-Death Processes Markov chain: Let us consider a Markov chain on the state space S=\{0, \cdots, N\} (N could be infinite) with transition probabilities have the following tridiagonal structure \begin{matrix} 0 \\1\\2\\3\\ \cdots \end{matrix} \begin{pmatrix} r_0 & p_0 & 0 & 0& \cdots \\ q_1 & r_1 & p_1 & 0 & \cdots \\ 0 & q_2 & r_2 & p_2 & \cdots \\ \vdots & &\ddots&\ddots & \ddots \end{pmatrix} This is called a birth and death process since the only possible transition are to move up or down by unit or stay unchanged. Random walks (see Example 5.5 and Example 8.1), discrete queueing models (Example 8.2 ), the Ehrenfest urn model Example 5.6 are special case of birth and death processes.

Birth and death Markov chain always satisfy detailed balance. Indeed the only non trivial detailed balance conditions are \pi(j)P(j,j+1) = \pi(j+1) P(j+1,j) \implies \pi(j) p_j \,=\, \pi(j+1) q_{j+1} \,, \quad \textrm{ for } j=0, \cdots, N-1 \,. and this can be solved recursively. We obtain \begin{aligned} \pi(1) \,&=\, \pi(0) \frac{p_0}{q_1} \\ \pi(2) \,&=\, \pi(1) \frac{p_1}{q_2} \,=\, \pi(0) \frac{p_0 p_1 }{q_1 q_2} \\ & \vdots \\ \pi(N) \,&=\, \pi(0) \frac{p_0 p_1 \cdots p_{N-1}}{q_1 q_2 \cdots q_{N-1}} \end{aligned} and with normalization \pi(j) \,=\, \frac{ \prod_{k=1}^j \frac{p_{k-1}}{q_k}}{ \sum_{l=0}^N \prod_{k=1}^l \frac{p_{k_1}}{q_k}}. If N is infinite we need the sum in the deominator to be finite.

For example the Ehrenfest urn in Example model has p_j \,=\, \frac{N-j}{N} \,, \quad q_j \,=\, \frac{j}{N} and thus we obtain \pi(j) \,=\, \pi(0) \frac{ \frac{N}{N} \frac{N-1}{N} \cdots \frac{N-(j-1)}{N}}{ \frac{1}{N} \frac{2}{N} \cdots \frac{j}{N}} \,=\, \pi(0) { N \choose j } \quad \textrm{ and also } \pi(0) =\sum_{j=0}^N {N \choose j} \,=\, 2^N.

12 Markov chain Monte-Carlo

MCMC

Suppose you are given a certain probability distribution \pi on a set S and you goal is to generate a sample from this distribution. The Monte-Carlo Markov chain (MCMC) method consists in constructing an irreducible Markov chain X_n whose stationary distribution is \pi. Then to generate \pi one simply runs the Markov chains X_n long enough such that it is close to its equilibrium distribution. It turns out that using the detailed balance condition is a very useful tool to construct the Markov chain in this manner.

A-priori this method might seems an unduly complicated way to sample from \pi. Indeed why not simply simulate from \pi directly using one of the method of Section Simulation

To dispel this impression we will consider some concrete examples but for example we will see that often one want to generate a uniform distribution on a set S whose cardinality |S| might be very difficult.

A large class of model are so-called “energy models” which are described by an explicit function f:S \to \mathbb{R} interpreted as the energy (or weight) of the state. Then one is interested in the stationary distribution \pi(i) = Z^{-1} e^{-f(i)} \quad \textrm { where } Z= \sum_{i}e^{-f(i)} \textrm{ is a normalization constant} As we will see computing the normalization constant Z can be very difficult and so one cannot directly simulate from \pi!

The measure \pi assign biggest probability to the states i where f(i) is the smallest, that is they have the smallest energy.
Often (e.g. in economics) the measure is written with a different sign convention \pi(i) = Z^{-1}e^{f(i)} which now favor the states with the highest values of f.

Proper q-coloring of a graph

For a graph G=(E,V) a proper q-coloring consists of assigning to each vertex v of the graph one of q colors subject to the constraint that if 2 vertices are linked by an edge they should have different colors. The set S' of all proper q-coloring which is a subset of S \,=\, \left\{ 1, \cdots, q\right\}^V. We denote the elements of S by \sigma = \{ \sigma(v)\}_{v \in V} with \sigma(v) \in \{1, \cdots, q\}. The uniform distribution on all such proper coloring is \pi(\sigma) = 1/|S'| for all \sigma \in S'. Even for moderately complicated graph it can be very difficult to compute to |S'|.

Consider the following transition rules:
1. Choose a vertex v of at random and choose a color a at random. 2. Set \sigma'(v) = a and \sigma'(w)=\sigma(w) for w \not=v. 3. If \sigma' is a proper q-coloring then set X_{n+1}=\sigma'. Otherwise set X_{n}=\sigma.

The transition probabilities are then given by \begin{aligned} P(\sigma, \sigma') =\left\{ \begin{array}{cl} \frac{1}{q |V|} & \textrm{ if } \sigma \textrm{ and } \sigma' \textrm{ differ at exactly one vertex} \\ 0 & \textrm{ if }\sigma \textrm{ and } \sigma' \textrm{ differ at more than one vertex} \\ 1 - \sum_{\sigma' \not = \sigma} P(\sigma, \sigma') & \textrm{ if } \sigma'=\sigma \end{array} \right. \end{aligned} Note that P(\sigma, \sigma) is not known explicitly either but is also not used to run the algorithm. The cardinality |S'| is also not needed.

In order to check that the uniform distribution is stationary for this Markov chain it note that P(\sigma, \sigma') is symmetric matrix. If one can change \sigma into \sigma' by changing one color then one can do the reverse transformation too.

Knapsack problem

This is a classical optimization problem. You own m books and the i^{th} book has weight w_i lb and is worth $ v_i. In your knapsack you can put at most a total of b pounds and you are looking to pack the most valuable knapsack possible.

To formulate the problem mathematically we introduce \begin{aligned} w \,&=\, (w_1, \cdots w_m) \in {\bf R}^m &\textrm{ weight vector} \\ v \,&=\, (v_1, \cdots v_m) \in {\bf R}^m &\textrm{ value vector} \\ \sigma\,&=\, (\sigma_1, \cdots \sigma_m) \in \{0,1\}^m &\textrm{ decision vector} \end{aligned} where we think that \sigma_i =1 if the i^{th} book is in the knapsack. The state space is S' \,=\, \left\{ \sigma \in \{0,1\}^{m} \,;\, \sigma \cdot w \le b \right\} and the optimization problem is \textrm{Maximize } v \cdot \sigma \textrm{ subject to the constraint } \sigma \in S'

As a first step we generate a random element of S' using the simple algorithm. If X_n=\sigma then

Choose j \in \{1, \cdots m\} at random.
Set \sigma' \,=\, ( \sigma_1, \cdots, 1-\sigma_j, \cdots, \sigma_m).
If \sigma' \in S', i.e., if \sigma' \cdot w \le b then let X_{n+1}=\sigma'. Otherwise X_{n+1}=\sigma.

In other words, choose a random book. If it is in the sack already remove it. If it is not in the sack add it provided you do not exceed the the maximum weight. Note that the Markov chain X_n is irreducible, since each state communicates with the state \sigma=(0, \cdots, 0). It is aperiodic except in the uninteresting case where \sum_{i} w_i \le b. Finally the transition probabilities are symmetric and thus the uniform distribution the unique stationary distribution.

In the knapsack problem we want to maximize the function f(\sigma)=\sigma\cdot v on the state space. One possible algorithm would be to generate an uniform distribution on the state space and then to look for the maximum value of the function. But it would be a better idea to sample from a distribution which assign higher probabilities to the states which we are interested in, the ones with a high value of f.

Let S be the state space and let f \,:\, S \to \mathbb{R} be a function. It is convenient to introduce the probability distributions define for \beta>0 by \pi_\beta(i) \,=\, \frac{e^{\beta f(i)}}{Z_\beta} \, \quad \textrm{ with } \quad Z_\beta =\sum_{j \in S} e^{\beta f(j)} \,. Clearly \pi_\beta assign higher weights to the states i with bigger values of f(i). Let us define S^* \,=\, \left\{ i \in S \,;\, f(i) = f^* \equiv \max_{j \in S} f(j) \right\} \,. If \beta=0 then \pi_0 is simply the uniform distribution on S. For \beta \to \infty we have \lim_{\beta \to \infty} \pi_\beta(i) \,=\, \lim_{\beta \to \infty} \frac{ e^{\beta (f(i)- f^*)}} { |S^*| + \sum_{j \in S\setminus S^*} e^{\beta (f(j) - f^* )}} \,=\, \left\{ \begin{array}{cl} \frac{1}{|S^*|} & {\rm ~if~} i \in S^* \\ 0 & {\rm ~if~} i \notin S^* \\ \end{array} \right. \,, i.e., for large \beta \pi_\beta is concentrated on the global maxima of f.

Metropolis algorithm

A fairly general method to generate a distribution \pi on the state space S is given the Metropolis algorithm. This algorithm assumes that you already know how to generate the uniform distribution on S by using a symmetric transition matrix Q.

Theorem 12.1 (Metropolis algorithm) Let \pi be a probability distribution on S with \pi(i)>0 for all i and let Q be a symmetric transition matrix. Consider the Markov chain with the following transition matrix (the Metropolis algorithm). If X_n=i

Choose Y \in S according to Q, i.e., P\{ Y=j \,\vert\, X_n=i\} \,=\, Q(i,j)
Define the acceptance probability \alpha(i,j) \,=\, \min \left \{ 1 \,, \frac{\pi(j)}{\pi(i)} \right\}
Accept Y with probability \alpha(i,Y) by generating random number U. If U \le \alpha then set X_{n+1}= Y (i.e., accept the move) and if U > \alpha then X_{n+1}= X_n (i.e., reject the move).

If Q is an irreducible transition probability matrix on S then the Metropolis algorithm defines an irreducible Markov chain on S which satisfies detailed balance with stationary distribution \pi.

Proof. Let P(i,j) be the transition probabilities for the Metropolis Markov chain. Then we have P(i,j) \,=\, Q(i,j) \alpha \,=\, Q(i,j) \min \left\{ 1 \,,\, \frac{\pi(j)}{\pi(i)} \right\} \,.

Since we assume \pi(i)>0 for all states i, the acceptance probability \alpha never vanishes. Thus if P(i,j)>0 whenever Q(i,j)>0 and thus P is irreducible if Q is itself irreducible.

In order to check the reversibility we note that \pi(i) P(i,j) \,=\, Q(i,j) \pi(i) \min \left\{ 1, \frac{\pi(j)}{\pi(i)} \right\} \,=\, Q(i,j) \min \left\{\pi(i)\,,\, \pi(j) \right\} and the r.h.s is symmetric in i,j and thus \pi(i) P(i,j) =\pi(j) P(j,i). \quad \blacksquare

Note that only the ratios \frac{\pi(j0}{pi(j)} are needed to run the algorithm, in particular we do not need the normalization constant. This is a very important feature of the Metropolis algorithm.
We could have chosen another acceptance probability \alpha(i,j). By inspection of the proof it is enough to pick \alpha(i,j) such that \alpha(i,j) \le 1 and \pi(i) \alpha(i,j) = \pi(j) \alpha(j,i) is symmetric. Some such examples will be considered in the homework and the choice given in Theorem 12.1 is actually optimal in the sense it gives the highest acceptance probability.
The general case with a non-symmetric proposal matrix Q is called the Metropolis-Hastings algorithm. In this case we use the acceptance probability is chosen to be \alpha(i,j) = \min \left\{ 1, \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)} \right\} and it is not difficult to check (see Homework) that the Metropolis-Hasting algorithm yields a reversible Markov chain with stationary distribution \pi.

Uniform distribution on a graph

Suppose we are given a graph G=(E,V). Often in application to computer networks or social networks the graph is not fully known. Only local information is available, given a vertetx one knows the vertices one is connected. Nonetheless you can run a random walk on the graph by choosing a edge at random and moving to the corresponding vertex, i.e. Q(v,w) = \frac{1}{\textrm{ deg}(v)} \textrm{ if } v \sim w and the stationary distribution is \pi(v) \propto deg(v).

Suppose you wish to generate a uniform distribution on the graph. Then we can use the Metropolis-Hasting to generate a Markov chain with a uniform stationary distribution. We have \alpha(v,w) = \min\left\{ 1, \frac{Q(w,v)}{Q(v,w)}\right\} =\min\left\{ 1, \frac{\textrm{ deg}(v)}{\textrm{ deg}(w)}\right\} \textrm{ if } v \sim w and thus the transition matrix P(v,w) = Q(v,w) \alpha(v,w) = \frac{1}{\textrm{ deg}(v)} \min\left\{ 1, \frac{\textrm{ deg}(v)}{\textrm{ deg}(w)}\right\} = \min\left\{ \frac{1}{\textrm{ deg}(v)}, \frac{1}{\textrm{ deg}(w)} \right\} generates a uniform distribution on a graph.

Metropolis algorithm for the knapsack problem

Consider the distribution \pi_\beta(\sigma) \,=\, \frac{e^{\beta v \cdot \sigma}}{ Z_\beta} where the normalization constant Z_\beta = \sum_{\sigma \in S'} e^{\beta v \cdot \sigma} is almost always impossible to compute in practice. However the ration \frac{\pi(\sigma')}{\pi(\sigma)} = e^{\beta v \cdot (\sigma'-\sigma)} does not involve Z_\beta and is easy to compute.

For this distribution we take as the proposal Q matrix used to generate a uniform distribution on the allowed states of the knapsack (see Knapsack problem) and the Metropolis algorithm now reads as follows. If X_n=\sigma then

Choose j \in \{1, \cdots m\} at random.
Set \sigma' \,=\, ( \sigma_1, \cdots, 1-\sigma_j, \cdots, \sigma_m).
If \sigma'\cdot w >b (i.e. if \sigma \notin S' then X_{n+1}=\sigma.
If \sigma'\cdot w >b then let \displaystyle \alpha \,=\, \min \left\{ 1 , \frac{ \pi(\sigma')}{\pi(\sigma)} \right\} \,=\, \min \left\{1, e^{\beta v \cdot (\sigma' - \sigma)} \right\} \,=\, \left\{ \begin{array}{cl} e^{- \beta v_j} & {\rm ~if~} \sigma_j=1 \\ 1 & {\rm ~if~} \sigma_j=0 \end{array} \right.
Generate a random number U, If U \le \alpha then X_{n+1}=\sigma'. Otherwise X_{n+1}=\sigma.

In short, if you can add a book to your knapsack you always do so, while you remove a book with a probability which is exponentially small in the weight.

Glauber algorithm

Another algorithm which is widely used for Monte-Carlo Markov chain is the Glauber algorithm which appear in the literature under a variety of other names such as Gibbs sampler in statistical applications, logit rule in economics and social sciences, heat bath in physics, and undoubtedly under various other names.

The Glauber algorithm is not quite as general as the Metropolis algorithm. Indeed we assume that the state space S has the following structure S \subset \Omega^V where both \Omega and V are finite sets. For example S \subset \{0,1\}^m in the case of the knapsack problem or S \subset \{ 1, \cdots, q\}^V for the case of the proper q-coloring of a graph. We denote by \sigma \,=\, \{ \sigma(v) \}_{v \in V} \,, \quad \sigma(v) \in \Omega \,. the elements of S.

It is useful to introduce the notation \sigma_{-v} \,=\, \{ \sigma(w) \}_{w \in V, w\not =v} and we write \sigma \,=\, ( \sigma_{-v}, \sigma(v) ) \,. to single out the v entry of the vector \sigma.

Theorem 12.2 (Glauber algorithm) Let \pi be a probability distribution on S\subset \Omega^V and extend \pi to \Omega^V by setting \pi(\sigma)=0 if \sigma \in \Omega^V \setminus S. If X_n \,=\, \sigma then

Choose v \in V at random.
Replace \sigma(v) by a new value a \in \Omega (provided ( \sigma_{-v}, a) \in S) with probability \frac{ \pi( \sigma_{-v}, a) } {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} \,. The Glauber algorithm defines a Markov chain on S which satisfies detailed balance with stationary distribution \pi.

The irreducibility of the algorithm is not guaranteed a-priori and needs to be checked on a case-by-case basis.

Proof. The transition probabilities are given by

P(\sigma, \sigma') = \left\{ \begin{array}{cl} \frac{1}{ |V|} \frac{ \pi( \sigma_{-v}, \sigma'(v)) } {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} & \textrm{ if } \sigma_{-v} = \sigma'_{-v} {\rm~for~some~}v \\ 0 & \textrm{ if } \sigma_{-v} \not= \sigma'_{-v} {\rm~for~all~}v \\ 1 - \sum_{\sigma'} P(\sigma, \sigma') & \textrm{ if } \sigma=\sigma' \end{array} \right.

To check detailed balance we note that if P(\sigma, \sigma')\not=0 \pi(\sigma) P(\sigma, \sigma') \,=\, \frac{\pi(\sigma) \pi(\sigma')} {\sum_{ b \in \Omega} \pi( \sigma_{-v}, b)} \,, and this is symmetric in \sigma and \sigma'. \quad \blacksquare.

Ising model on a graph

Let G=(E,V) be a graph and let S= \{-1,1\}^V. That is to each vertex assign the value \pm 1, you can think of a magnet at each vertex pointing either upward (+1) or downward (-1). To each \sigma \in S we assign an “energy” H(\sigma) given by H(\sigma) \,=\, - \sum_{ e=(v,w) \in E} \sigma(v) \sigma(w) \,. The energy \sigma is minimal if \sigma(v) \sigma(w) =1 i.e., if the magnets at v and w are aligned. Let us consider the probability distribution \pi_\beta (\sigma) \,=\, \frac{e^{-\beta H(\sigma)} }{Z_\beta} \,, \quad Z_\beta \,=\, \sum_{\sigma} e^{-\beta H(\sigma)} \,. The distribution \pi_\beta is concentrated around the minima of H(\sigma). To describe the Glauber dynamics note that H ( \sigma_{-v}, 1) - H( \sigma_{-v}, -1) \,=\, -2 \sum_{ w\,;\, w \sim v} \sigma(w) and this can be computed simply by looking at the vertices connected to v and not at all the graph. So the transition probabilities for the Glauber algorithm are given by picking a vertex at random and then updating with probabilities \frac{ \pi( \sigma_{-v}, \pm 1)} { \pi( \sigma_{-v}, 1) + \pi( \sigma_{-v}, -1) } \,=\, \frac{1}{ 1 + e^{\pm \beta \left[ H ( \sigma_{-v}, 1) - H( \sigma_{-v}, -1)\right]}} \,=\, \frac{1}{1 + e^{\mp 2 \beta \sum_{ w\,;\, w \sim v} \sigma(w)}} \,.

By comparison for the Metropolis algorithm we pick a vertex at random and switch \sigma(v) to -\sigma(v) and accept the move with probability \min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\, \min\left\{ 1, \frac{ \pi( \sigma_{-v}, - \sigma(v))}{\pi( \sigma_{-v}, \sigma(v))} \right\} \,=\, \min \left\{ 1, e^{ 2\beta \sum_{ w\,;\, w \sim v} \sigma(w)\sigma(v)} \right\} \,.

Simulated annealing

Consider again the problem of finding the minimum of a function f(j), that is f^*= \min\{f(j), j \in S\} and we denote by S^*\subset S the location of the global minima. You should think of f as a complicated (and non-convex) function with complicated level sets and various “local” minima.

To perform this task we sampling a distribution of the form \pi_T(j) = e^{-\frac{f(j)}{T}} which concentrates on the minima of f(j), and the more so as T \to 0.

The idea of simulated annealing comes from physics. The concept of annealing in physics is to obtain a low energy state of a solid (typically a crystal) you first heat it up to reach a liquid state and then, slowly, decrease the temperature to let the particles arrange themselves.

For a Markov chain the idea is to pick a temperature schedule T_1 > T_2 > T_3 > \cdots \quad \textrm{ with } \lim_{k \to \infty} T_k =0 with T_1 sufficiently large.

The following algorithm is a Markov chain with nonstationary transition probabilities.

Simulated annealing algorithm:
+ Initialise the Markov chain X_0 and the temperature T_1.
+ For each k run N_k steps of the Metropolis or Gibbs sampler with invariant distribution \pi_{T_k}.
+ Update the temperature to T_{k+1} starting with the final configuration.

A nice result about Metropolis sampler can be found for example in Hajek (1988) (a more precsie version is given there)

Theorem 12.3 (Convergence of simulated annealing) For the simulated annealing of the Metropolis algorithm (N_k=1) there exists a constant d^* such that we have \lim_{n \to \infty} P\{X_n \in S^*\} = 1 if and only if \sum_{k=1}^\infty e^{-\frac{d^*}{T_k}} =\infty The constant d^* measure the depths of the local minima of f where locality is measure in terms of the proposal matrix Q.

The theorem tells us that if decrease the temperature very very slowly, e.g. like T_k = \frac{c}{\log(k)} with c> d^* then the Markov chain will converge, with probabilty 1, to a minima.
This an extremely slow cooling schedule which makes it not very practical and other schedules are used like T_{k+1} = 0.99 T_k.

Parallel tempering

The idea of parallel tempering is not use a temperature schedule but rather to use several copies of the system running at different temperatures. The copies at high temperature will explore the state space more efficiently and there is a switching mechanism which exchange the different copies so that the lower temperature copies can take advantage of the exploration done at high temperature.

The state space is the product of k copies of S S^{(K)} = \underbrace{S \times \cdots \times S}_{\textrm{k times}} and we denote a state by the vector \mathbf{i} = (i_1, \cdots, i_K).
One picks a set of temperatures T_1 < T_2 < \cdots < T_k and a probability distribution \pi^{(k)} ({\bf i}) =\prod_{l=1}^K \pi_{T_l}(i_l)= \prod_{l=1}^K Z_{l} e^{- \frac{f(i_l)}{T_l}}
which is the product of the distribution at different temperatures.

The parallel tempering consists if two kinds of moves. The parallel move update each component of \mathbf{X}_n independently with the Metropolis algorithms at different temperature and there is a swapping move which exchanges a pair of components of the state vector \mathbf{X}_n in such a way as not to disturb the invariant measure. The component at the lowest temperature can be used to find the desired minimum.

Parallel tempering algorithm:
+ Suppose the state \mathbf{X}=\mathbf{i}, pick a random number U. + If U < \alpha then we do the parallel step and update each compoent of \mathbf{X} according to a Metropolis move at the corresponding temperatures T_l.
+ If U \ge \alpha we do the swapping step. We randomly chose a neighboring pair l,l+1 and propose to swap the components X_{n,l} and X_{n,l+1}. We accept the swap with probability \min\left\{ 1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})} \right\}

The parallel moves clearly satisfy satisfy the detailed balance since each component does. As for a swap move which swaps the component i_l and i_{l+1} n the state vector, we also have detailed balance since \begin{aligned} & \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) (1-\alpha) \frac{1}{K-1} \min\left\{ 1, \frac{ \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) }{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1})} \right\} \\ & = (1-\alpha) \frac{1}{K-1} \min \left\{ \pi_{T_l}(i_l) \pi_{T_{l+1}}(i_{l+1}) , \pi_{T_l}( i_{l+1}) \pi_{T_{l+1}}( i_l ) \right\} \end{aligned} which is symmetric in i_l,i_{l+1}.

13 Coupling

Total variation norm

Given two probability measure \mu and \nu on S the total variation distance between \mu and \nu is given by \|\mu - \nu\|_{\textrm{TV}}= \sup_{A \subset \Omega}|\mu(A) - \nu(A)| that is the largest distance between the measures of sets.
The supremum over all set is not convenient to compute but we have the formula

Theorem 13.1 We have the formula \begin{aligned} \|\mu - \nu\|_{\textrm{TV}} &= \sum_{i: \mu(i) \ge \nu(i)} (\mu(i) - \nu(i))(x) = \frac{1}{2} \sum_{i} | \mu(i) - \nu(i)| \end{aligned} \tag{13.1}

Proof. First note that the second equality in Equation 13.1 follows from the first. Indeed, if the first equality holds, then by interchanging \mu and \nu we also have
\|\mu - \nu\|_{\textrm{TV}} = \sum_{i: \mu(i) \ge \nu(i)} |\mu(i) - \nu(i)| = \sum_{i: \nu(i) \ge \mu(i)} |\nu(i) - \mu(i)| which proves the second equality.

To prove the first equality in Equation 13.1 we consider the set B=\{i:\mu(i) \ge \nu(i)\}. For any event A we have \begin{aligned} \mu(A) - \nu(A) = \sum_{i \in A } \mu(i)-\nu(i) \le \sum_{i \in A \cap B} \mu(i)-\nu(i) \le \sum_{i \in B} \mu(i)-\nu(i) =\mu(B)-\nu(B) \end{aligned} By interchanging the role of \mu and \nu we find \begin{aligned} \nu(A) - \mu(A) &\le \nu(B^c)-\mu(B^c) = \mu(B) -\nu(B) \end{aligned} and thus for any set A we have |\mu(A)-\nu(A)|\le \mu(B) - \nu(B). \quad\blacksquare

The total variation is also intimately related to the notion of coupling between probability measures. A coupling between the probability measure \mu and \nu is a probability measure q(i,j) on the product space S \times S such that \sum_{j} q(i,j) = \mu(j) \quad \textrm { and } \quad \sum_{j} q(i,j) = \mu(j) i.e. the marginals of q are \mu and \nu.
Coupling are nicely expressed in terms of random variables. We can think of q(i,j) has the (joint) pdf of the rv (X,Y) where X has pdf \mu and Y has pdf \nu.
There always exists a coupling since we can always hoose X and Y independent, i.e. q(i,j)=\mu(i)\nu(j).
On the oppposite extreme if \mu=\nu then we can pick X=Y has a coupling i.e q(i,i)=\mu(i)=\nu(i) and q(i,j)=0 if i\not=j.

Total variation and coupling

Theorem 13.2 (Coupling representation of total variation) We have \|\mu -\nu\|_{\textrm{TV}}= \inf \left\{ P\{X \not =Y\}\,; (X,Y) \textrm{ coupling of } \mu \textrm { and } \nu \right\} \tag{13.2}

Proof. We first prove an inequality: \begin{aligned} \mu(A)-\nu(A) &= P\{X \in A\} - P\{Y \in A\} \le P\{X \in A\}- P\{X \in A, Y \in A\} \\& = P\{X \in A, Y \notin A\} \le P\{ X \not=Y\} \end{aligned} and thus \|\mu-\mu\|_{\textrm{TV}} \le \inf P \{ X \not=Y\}.

To show the equality we construct an optimal coupling.

Recall from Theorem 13.1 that \displaystyle \|\mu-\nu\|_{\textrm{TV}}=\sum_{\mu(i)\ge \nu(i)}\mu(i)-\nu(i) = \sum_{\nu(i)\ge \mu(i)}\nu(i)-\mu(i) (see regions A and B in Figure 13.1) and we set p= \sum_{i} \mu(i) \wedge \nu(i) = 1 - \|\mu-\nu\|_{\textrm{TV}} (see region C in Figure 13.1).

Figure 13.1: schematics of the optimal coupling

Consider now the coupling defined as follows. Pick a random number U

If U \le p then let Z be the RV with pdf \gamma_C(i)=\frac{ \mu(i) \wedge \nu(i) }{p} and set X=Y=Z.

If U>p then let X be the random variable with pdf \gamma_A(i) and Y be the random variable with pdf \gamma_B(i) where \begin{aligned} \gamma_A(i) =\left\{ \begin{array}{cl} \frac{\mu(i)-\nu(i)}{\|\mu-\nu\|_{\textrm{TV}}} & \mu(i)\ge \nu(i) \\ 0 & \textrm{otherwise} \end{array} \right. \quad \gamma_B(i) =\left\{ \begin{array}{cl} \frac{\nu(i)-\mu(i)}{\|\mu-\nu\|_{\textrm{TV}}} & \nu(i)\ge \mu(i) \\ 0 & \textrm{otherwise} \end{array} \right. \end{aligned}

Since p \gamma_C + (1-p) \gamma_A = \mu and p \gamma_C + (1-p) \gamma_B = \nu this defines a coupling and we have P\{ X\not=Y\}=1-p =\|\mu-\nu\|_{\textrm{TV}}. \quad \blacksquare.

Coupling of Markov chains

A coupling of a Markov chain is a stochastic process (X_n,Y_n) with state space S\times S such that both X_n and Y_n are Markov chains with transition matrix P but with possibly different initial conditions.
A Markovian coupling is a coupling such that the joint (X_n,Y_n) is itself a Markov chain with some transition matrix Q((i,j), (k,l)) which must then satisfy \sum_{k} Q((i,j), (k,l)) = P(j,l) \quad \textrm{ and } \sum_{l} Q((i,j), (k,l)) = P(i,k) \,.
Recall that Theorem 6.3 where we proved the convergence of aperiodic irreducible Markov chains to their stationary distribution we used an independent coupling. We will expand upon this idea by considering other more intersting couplings.
The coupling time is defined by \sigma= \inf \{ n \ge 0 \,:\, X_n=Y_n \} that is, it is the first time that the Markov chain visit the same state.
After the coupling time \sigma, X_n and Y_n have the same distribution (by the strong Markov property) so we can always modified a coupling so that, after time \sigma X_n and Y_n are moving together, i.e. X_n=Y_n for all n \ge \sigma. We will always assume this to be true in what follows.

Speed of convergence via coupling

Theorem 13.3 Suppose (X_n,Y_n) is a coupling of a Markov chain such that X_0=i and Y_0=j and \sigma is the coupling time. Then we have \|P^n(i,\cdot) - P^n(j,\cdot)\|_{ \textrm{TV} } \le P \left\{ \sigma>n| X_0=i, Y_0=j\right\}

Proof. We have P\{X_n=l|X_0=i\}=P^n(i,l) and P\{Y_n=l|X_0=j\}=P^n(j,l) and therefore X_n and Y_n is a coupling of the the probability distributions P^n(i,\cdot) and P^n(j,\cdot). So by Theorem 13.2 we have \|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}} \le P \left\{ X_n \not= Y_n | X_0=i,Y_0=j\right\} = P \left\{ \sigma > n | X_0=i,Y_0=j\right\}\,. \blacksquare

We can use this result to bound the distance to the stationary measure

Theorem 13.4 We have \sup_{i \in S } \| P^n(i,\cdot) - \pi\|_{\textrm{TV}} \le \sup_{i,j \in S } \|P^n(i,\cdot) - P^n(j,\cdot)\|_{\textrm{TV}}

Proof. Using the stationarity and the triangle inequality we have \begin{aligned} \| P^n(i,\cdot) - \pi\|_{\textrm{TV}} &= \sup_{A} |P^n(i,A) - \pi(A)| \\ & = \sup_{A}| \sum_{j} \pi(j) (P^n(i,A)-P^n(j,A))| \\ %& \le \sup_{A} \sum_{j} \pi(j) |P^n(i,A)-P^n(j,A)| \\ & \le \sum_{j} \pi(j) \sup_{A} |P^n(i,A)-P^n(j,A)| \\ & = \sum_{j} \pi(j) \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \\ & \le \sup_{i,j\in S} \|P^n(i,\cdot) - P^n(j,\cdot)\|_{TV} \quad \blacksquare \end{aligned}

We set d(n) = \sup_{i} \|P^n(i,\cdot) - \pi \|_{\textrm{TV}} which is the maximal distance to stationarity starting from an arbitrary initial state. (It is not hard to see that \|\mu P^n - \pi \|_{\textrm{TV}} \le d(n) for abitrary initial distribution \mu as well).
We define then the mixing time t_{\textrm{mix}}(\varepsilon) of a Markov chain to be t_{\textrm{mix}}(\varepsilon) = \min\{ n, d(n) \le \varepsilon \}\,. That is, if n>t_{\textrm{mix}}(\varepsilon) then \mu P^n is less than \varepsilon close to the stationary distribution.

Mixing time for the sucess run chain

We start with a countable state space example, the success run chain which is very easy to analyze.

Parenthetically, it should be noted that the supremum over i in d(n) is often not well suited for countable state space. It may often happen that the number of steps it take to be close to the stationary distribution may depend on where you start.

We consider a special case of Example 8.4 with constant succes probability. P(n,0)=(1-p)\, P(n,n+1)= p\,, n=0,1,2,\cdots

Suppose X_0=i and Y_0=j (with i\not=j) we couple the two chains by moving them together.

Pick a random number U, if $U (1-p)) then set X_1=Y_1=0 the couling time \sigma=1.
If U \ge 1-p then $move X_1=i+1, Y_1=j+1.

Clearly we have P( \sigma >n| X_0=i, Y_0=j) = p^n and thus we find for the mixing time
d(n)=p^n < \varepsilon \iff n \ge \frac{\ln(p)}{\ln \varepsilon}

Mixing time for the random walk on the hypercube

Recall that for the random walk on the hypercube (see Example 5.4) a state is a d-bit \sigma=(\sigma_1, \cdots, \sigma_d) with \sigma_j\in \{0,1\}.

The Markov chain is periodic with period 2 so to make it aperiodic we consider its “lazy” version in which we do not move with probability 1/2, that is we consider instead the transition matrix \frac{P+I}{2} which makes it periodic. The Markov chain moves then as follows, we pick a coordinate \sigma_j at random and then replace by a random bit 0 or 1.

To couple the Markov chain we simply move move them together: If X_n=\sigma and Y_n=\sigma' we pick a coordinate j \in \{1,\cdots, d\} at random and then replace both \sigma_j and \sigma'_j by the same random bit. This is a coupling and after a move the chosen coordinates coincide.

Under this coupling X_n and Y_n will get closer to each other if we select a j such that \sigma_j \not= \sigma'_j and we will couple when all the coordinates have been selected. The distribution of the coupling time is thus exactly the same as the time \tau need to collect d toys in the coupon collector problem.

We now get a bound on the tail as follows. We denote by A_i the events that coordinate i has not been selected after m steps. We have, using the inequality (1-x)\le e^{-x} \begin{aligned} P\{\sigma > n\} = P\left(\bigcup_{i=1}^d A_i\right) \le \sum_{i=1}^d P(A_i) = \sum_{i=1}^d \left(1- \frac{1}{d}\right)^n \le d e^{-\frac{n}{d}} \end{aligned} So if we pick n = d\ln(d) + c d we find P\{\sigma > d\ln(d) + c d\} \le d e^{-\frac{d \ln(d) + cd}{d}}= e^{-c} \implies t_{\textrm{mix}}(\varepsilon) \le d\ln(d) + \ln(\varepsilon^{-1})d\,.

14 Poisson Process

Definition of the Poisson process

Counting processes N_t are continuous time stochastic process which counts the number of events occuring up to time t: N_t = \textrm{total number of events occuring up to time } t e.g. the number of accidents reported to an insurance company, the number of earthquakes in California, and so on.
The Poisson process is the simplest counting process, intuitvely characterized by the following properties:
1. The number of events occuring in disjoint time intervals are independent.
2. The rate at which event occur is constant over time.
3. The events occur one at a time.
More formally a Poisson process N_t with rate paratmeter \lambda is a continuous time stochastic process such that
- Independent increments: Given times s_1\le t_1 \le s_2 \le t_2 \cdots \le s_n \le t_n the random variables X_{t_i}-X_{s_i} (that is the number of events occurring in the disjoint time intervals [s_i,t_i]) are independent.
- The rate at which events occur is denoted by \lambda>0 and for any (small) time interval [t, t + \Delta t] we have
  \begin{aligned} P \{ N_{t + \Delta t } = N_t \} &= 1- \lambda \Delta t +o(\Delta t) & \\ P\{N_{t + \Delta t}= N_t +1\} &= \lambda \Delta t +o(\Delta t) & \textrm{ with } \lim_{\Delta t \to 0}\frac{o(\Delta t)}{\Delta t}=0 \\ P\{N_{t + \Delta t}\ge N_t +2\} & = o(\Delta t) & \end{aligned} \tag{14.1}

Distribution of the Poisson process

Theorem 14.1 (Distribution of the Poisson process) The Poisson process N_t with N_0=0 has the distribution P\{N_t=k\} = e^{-\lambda t}\frac{(\lambda t)^k}{k!} i.e. N_t has Poisson distribution with paramter \lambda t. Morever for s, M_t=N_{t+s}-N_{s} is a Poisson process.

Proof (version 1 using Poisson limit). Pick a large number n and divide the interval [0,t] into n intervals of size \frac{t}{n}. Write
N_t = \sum_{j=1}^n\left( N_{j \frac{t}{n}}- N_{(j-1) \frac{t}{n}} \right) as a sum of n independent random variables. If n is large the probability that any of these random variables is at least 2 is small. Indeed, by a union bound, we have \begin{aligned} P\left\{ N_{j \frac{st}{n}}- N_{(j-1) \frac{t}{n}} \ge 2 \textrm { for some } j \right\} &\le \sum_{j=1}^n P\left\{ N_{j \frac{t}{n}}- N_{(j-1) \frac{t}{n}} \ge 2 \right\} \le n P \left\{ N_{\frac{t}{n}} \ge 2\right\} = t \frac{o( \frac{t}{n})}{\frac{t}{n}} \end{aligned} which goes to 0 as n \to \infty.

Therefore N_t is, approximately, a binomial random variables with success probability \lambda \frac{t}{n}: P(N_t=k) \approx {n \choose k} \left(\frac{\lambda t}{n} \right)^k \left(1-\frac{\lambda t}{n} \right)^{n-k} and as n \to \infty this converges to a Poisson distribution with parameter \lambda t. \quad \blacksquare.

Proof (version 2 using ODEs). Let us derive a system of ODEs for the family P_t(k)=P\left\{ N_t=k \right\}. We have \frac{d}{dt}P_t(k) = \lim_{\Delta t \to 0} \frac{P\left\{ N_{t+\Delta t}=k \right\} - P\left\{ N_t=k \right\} }{\Delta t} Conditioning we find \begin{aligned} P\left\{ N_{t+\Delta t}=k \right\}=& P\left\{ N_{t+\Delta t}=k | N_t=k \right\} P\left\{ X_{t}=k \right\} \\ & + P\left\{ N_{t+\Delta t}=k | N_t=k-1 \right\}P\left\{ X_{t}={k-1} \right\} \\ & + P\left\{ N_{t+\Delta t}=k | N_t\le k-2\right\}P\left\{X_t \le k-2\right\} \\ =& P_t(k) (1 - \lambda \Delta t) + P_t(k-1) \Delta t + o(\Delta t) \end{aligned} and this gives the system of equations \begin{aligned} \frac{d}{dt}P_t(0)&= - \lambda P_t(0) & \quad P_0(0)=1 \\ \frac{d}{dt}P_t(k)&= \lambda P_t(k-1) - \lambda P_t(k) &\quad P_0(k)=0 \end{aligned}

We find P_0(t)= e^{-\lambda t} for k=0. We use an integrating factor and set f_t(k)=e^{\lambda t} P_t(k). We have then f_t(0)=1 and for k >0 \frac{d}{dt}f_t(k)=\lambda f_t(k-1), \quad f_0(k)=0
which we can solve iteratively to find f_t(k) = \frac{(\lambda t)^k}{k!} and thus N_t has distribution P_t(k)= e^{-\lambda t}\frac{(\lambda t)^k}{k!}\,. \blacksquare.

Poisson process and exponential random variables

Poisson processes and exponential (and gamma) random variables are intimately related. Given a Poisson process we consider the interarrival times T_1, T_2, \cdots: T_1 is time of the occurence of the first event, T_2 is the time elapsed between the first and second event, and so on.

Theorem 14.2 If N_t is a Poisson process with parameter \lambda the interarrival times are independent exponential random variables with paramter \lambda.

Proof. If T_1 >t it means no event has occured up time t and so N_t=0. Therefore P\left\{ T_1 >t \right\} = P\left\{ N_t =0\right\} = e^{-\lambda t} and thus T_1 has an exponential distribution. For T_2 we condition on T_1 P\left\{ T_2 >t \right\} = \int P(T_2>t |T_1 =s) f_{T_1}(s) ds and, using the independence of the increments, P\left\{ T_2>t |T_1 =s\right\} = P\left\{ 0 \textrm{ events } in (s, s+t] | T_1=s \right\} = P\left\{ 0 \textrm{ events } in (s, s+t] \right\} = e^{-\lambda t} from which we conclude that T_2 has exponential distribution and is independent of T_1. This argument can be repeated for T_3 by conditioning on the time of the second event, T_1+T_2, and so on. \quad \blacksquare

Another set of closely related quantities are the arrival times of the n^{th} event S_1, S_2, \cdots which are related to the interarrival times by S_1=T_1, \quad S_2=T_1+T_2, \quad S_3=T_1+T_2+T_3, \cdots By Theorem 14.2 S_n is the sum of n independent exponential RVs and thus S_n a Gamma RV with parameter (n,\lambda) with density f_{S_n}(t) = \lambda e^{-\lambda t}\frac{(\lambda t)^{n-1}}{(n-1)!}\, \quad \textrm{ for }t \ge 0\,. We can actually prove this fact using the Poisson process by noting that, by definition,
N_t \ge n \iff S_n \le t \,, that is if n or more events have occured by time t if and only the n^{th} event has occured prior to or at time t.

So the CDF of S_n is F_{S_n}(t)=P\{ S_n \le t\} = P \{ N_t \ge n\} = \sum_{j=n}^\infty e^{-\lambda t} \frac{(\lambda t)^{j}}{j!} and upon differentiating we find f_{S_n}(t)= - \lambda e^{-\lambda t} \sum_{j=n}^\infty \frac{(\lambda t)^{j}}{j!} + \lambda e^{-\lambda t}\sum_{j=n}^\infty \frac{(\lambda t)^{j-1}}{(j-1)!} = \lambda e^{-\lambda t} \frac{(\lambda t)^{n-1}}{(n-1)!}. \quad \blacksquare

Poisson process and uniform distribution

Let us start with a special case and assume assume that N_t=1, that is exactly one event has occured in [0,t]. Since the Poisson process has independent increments it seems reasonable the event may have occur with equal probability at any time on [0,t]. Indeed for s\le t we have, using the independence of increments. \begin{aligned} P\{T_1 \le s | N_t=1\} & = \frac{P\{T_1 \le s , N_t=1\}}{P\{N_t=1\}} = \frac{P\{\textrm{1 event in }(0,s] \textrm{ and no event in (s,t]} \}}{P\{N_t=1\}} \\ &= \frac{ \lambda s e^{-\lambda s} e^{-\lambda(t-s)} }{\lambda t e^{-\lambda t}} = \frac{s}{t} \end{aligned} and thus the density of the arrival time T_1 conditioned on N_t=1 is uniform on [0,t]

We study further the properties of the arrival times S_1< S_2 <\cdots of a Poisson process. The following result tells us that they follow a uniform distribution on [0,t].

Theorem 14.3 Given the event \{N_t=n\}, the n arrival times S_1, S_2, \cdots have the same distribution as the order statistics for n independent random variables uniformly distributed on [0,t].

Proof. The conditional density of (S_1, \cdots, S_n) given that N_t=n can be obtained as follows. If S_1=s_1, S_2=s_2, \cdots, S_n=s_n and N_t=n then the intearrival times must satisfy T_1=s_1, T_2=s_2-s_1, \cdots, T_n= s_n-s_{n-1}, T_{n+1}> t-s_n.

By the independence of the interarrival times proved in Theorem 14.2 the conditional density is given by \begin{aligned} f(s_1, s_2, \cdots, s_n|n) &= \frac{f(s_1, s_2, \cdots, s_n, n)}{P\{N_t=nxw\}} \\ &= \frac{\lambda e^{-\lambda s_1} \lambda e^{-\lambda(s_2-s_2)} \cdots \lambda e^{-\lambda(s_n-s_{n-1})} e^{-\lambda(t-s_n)} }{e^{-\lambda t} \frac{(\lambda t)^n}{n!} } = \frac{n!}{t^n} \end{aligned} which is the joint density of the order statistic of n uniform. \blacksquare

Recall if X_1, \cdots, X_n are IID random variable with joint density f(x_1)\cdots f(x_n) and X^{(1)} \le X^{(2)} \le \cdots X^{(n)} the order statistics, then the joint pdf of X^{(1)}, \cdots ,X^{(n)} is given by g(x_1,\cdots, x_n)) = \left\{ \begin{array}{cl} n! f(x_1) \cdots f(x_n) &\textrm{ if } x_1 \le x_2 \le \cdots x_n \\ 0 & \textrm{ else } \end{array} \right.

Simulation of Poisson process

The characterization of the Poisson process in terms of exponential random variables suggest immediately a very simple algorithm to simulate N_t.

Simulate independent exponential RVs T_1, T_2, \cdots with parameter \lambda and set N_t=0 for 0\le t <T_1, N_t=1 for T_1\le t <T_1+T_2, and so on.

Figure 14.1: A sample of the Poisson process

Simulation of a Poisson random variable

It is not easy to simulate directly a Poisson random variable X from its pdf/cdf but we can do it elegantly using its relation with exponential random variable. To do this generate independent exponential random variable until they sum up to 1 (so as to generate X=N_1) and use the relation between exponential and uniform.
\begin{aligned} X =n & \iff \sum_{k=1}^n T_k \le 1 < \sum_{k=1}^{n+1} T_k \iff \sum_{k=1}^n - \frac{1}{\lambda} \ln(U_k) \le 1 < \sum_{k=1}^{n+1} - \frac{1}{\lambda} \ln(U_k) \\ & \iff \ln( \prod_{k=1}^n U_k) \ge - \lambda > \ln( \prod_{k=1}^{n+1} U_k) \iff \prod_{k=1}^n U_k \ge e^{-\lambda} > \prod_{k=1}^{n+1} U_k \end{aligned}
Algorithm to simulate a Poisson random variable with parameter \lambda: Generate random numbers until their product is smaller than e^{-\lambda}.
- Generate random number U_1, U_2, \cdots.
- Set X=n if n+1 = \inf \left\{ j \,:\, \prod_{k=1}^{j} U_j < e^{-\lambda} \right\}

Long-time behavior of the Poisson process

We investigate the behavior of N_t for large time. We prove a CLT type result, namely that
\lim_{t \to \infty} \frac{N_t-\lambda t}{\sqrt{\lambda t}} = Z \textrm{ in distribution} where Z is a standard normal RV.
Recall that the characteristic function of a Poisson RV Y with parameter \mu is E[e^{i\alpha Y}]= e^ {\mu (e^{i\alpha} -1)}. Therefore E\left[ e^{i \alpha \frac{N_t - \lambda t}{\sqrt{\lambda t}}} \right] = e^{\lambda t \left( e^{i \frac{\alpha}{\sqrt{\lambda t}}} -i \frac{\alpha}{\sqrt{\lambda t}} -1\right)} Expanding the exponential we have \lambda t \left(e^{i \frac{\alpha}{\sqrt{\lambda t}}} -i \frac{\alpha}{\sqrt{\lambda t}} -1\right) = - \frac{\alpha^2}{2} + O( \frac{1}{\sqrt{\lambda t}}) and thus \lim_{t \to \infty} E\left[ e^{i \alpha \frac{N_t - \lambda t}{\sqrt{\lambda t}}} \right]=e^{-\frac{\alpha^2}{2}}.
The same computation shows also that, for any fixed t, \lim_{\lambda \to \infty} \frac{N_t-\lambda t}{\sqrt{\lambda t}} = Z \textrm{ in distribution} since rescaling the parameter is equivalent to rescaling time.

Sampling a Poisson process

We can sample or split a Poisson process. Suppose that every event of a Poisson process (independently of the other events) comes into two different types, say type 1 with probability p and type 2 with probability q=1-p.

Theorem 14.4 Suppose N_t is a Poisson process with paramter \lambda and that every event (independently) is either of type 1 with probability 1 or type 2 with probability q=1-p. Then N^{(1)}_t, the number of events of type 1 up to time t, and N^{(2)}_t, the number of events of type 2 up to time t, are independent Poisson process with rate \lambda p and \lambda(1-p).

Proof. We check that N^{(1)}_t satisfy the definition of a Poisson process and the use Theorem 14.1

N^{(1)}_0=0 and N^{(1)}_t has independent increments since N_t has independent increments and events are classified of type 1 and 2 independently of each other.

We have
\begin{aligned} P \left\{ N^{(1)}_{t + \Delta t}= N^{(1)}_t +1 \right\} & = P\{ N_{t + \Delta t}= N_t +1 \textrm{ and the event is of type } 1 \} \\ & + P\{ N_{t + \Delta t} \ge N_t +2 \textrm{ and exactly one event is of type } 1 \} \\ &= \lambda \Delta t \times p + o(\Delta t) \end{aligned}

\begin{aligned} P \left\{ N^{(1)}_{t + \Delta t}= N^{(1)}_t \right\} & = P\{ N_{t + \Delta t}= N_t \} + P\{ N_{t + \Delta t} = N_t + 1 \textrm{ and the event is of type 2} \} \\ & + P\{ N_{t + \Delta t} \ge N_t + 2 \textrm{ and no event of type $1$} \} \\ &= (1- \lambda \Delta t) + \lambda \Delta t (1-p) + o(\Delta t) =1-\lambda \Delta t p + o(\Delta t) \end{aligned} P \left\{ N^{(1)}_{t + \Delta t}\ge N^{(1)}_t +2 \right\} =o(\Delta t)

Finally to show that N^{(1)}_t and N^{(2)}_t are independent we compute their joint PDF by conditioning on the value of N_t and find \begin{aligned} P\left\{N^{(1)}_t =n , N^{(2)}_t =m \right\} &= P\left\{N^{(1)}_t =n , N^{(2)}_t =m | N_t={n+m}\right\} P\left\{N_t={n+m}\right\} \\ &= {n+m \choose n} p^n (1-p)^m e^{-\lambda t} \frac{(\lambda t)^{n+m}}{(n+m)!} \\ &= e^{-\lambda p t} \frac{ (\lambda p t )^n}{n!} e^{-\lambda (1-p) t} \frac{(\lambda(1-p) t)^{m}}{m!} \end{aligned} \blacksquare

The coupon collecting problem

We revisit the couplon collector but we relax the assumption that all the toys are equally probable. We assume that any box contains toy i with probability p_i. How do we compute now the expected number of boxes needed to collect all the M toys? The argument used earlier does not generalize easily.
We use the following trick or radomizing the time between boxes. Instead of collecting boxes at fixed time interval, we collect them at times which are exponentially distributed with paramter 1. Then the number of boxes collected up to time t a Poisson process N_t with rate \lambda=1 (on average it takes the same time to get a new box). We have now M types of events (getting a box with toy i) and we split the poisson process accordingly. Then by Theorem 14.4 the number of toys of type i collected up to time t, N^{(i)}_t is a Poisson process with rate \lambda p_i=p_i and the Poisson processes N^{(i)}_t are independent.
We now consider the times T^{(i)} = \textrm{ time of the first event for the process } N^{(i)}_t that is the time where the first toy of type i is collected. The times T^{(i)} are independent since the underlying Poisson processes are independent, and are exponential with paramter p_i. Furthermore S = \max_{i} T^{(i)} = \textrm{ time until one toy of each type has been collected}. By independence we have P\left\{ S \le t \right\} = P\left\{ \max_{i} T^{(i)} \le t\right\} = \prod_{i=1}^M P\left\{ T^{(i)} \le t \right\} = \prod_{i=1}^M (1- e^{-p_it})

Thus E[S]= \int_0^\infty P\{S \ge t\} dt = \int_0^\infty ( 1- \prod_{i=1}^M (1- e^{-p_it}) )\, dt

Finally we relate S to the original question. If X is the number of box needed to collect all the toys then we have S = \sum_{k=1}^X S_k where S_k aree IID exponential with parameter 1. But conditioning it is easy to see that E[S]= E[N]E[S_1] = E[N] and we are done.

Poisson process with variable rate

We can generalize the Poisson process by making the rate \lambda(t) at which event occur depend on time: a nonhomogeneous Poisson process N_t with rate paratmeter \lambda(t) is a continuous time stochastic process such that
Independent increments: Given times s_1\le t_1 \le s_2 \le t_2 \cdots \le s_n \le t_n the random variables N_{t_i}-N_{s_i} (that is the number of events occurring in the disjoint time intervals [s_i,t_i]) are independent.
We have \begin{aligned} P \{ N_{t + \Delta t } = N_t \} &= 1- \lambda(t) \Delta t +o(\Delta t) & \\ P\{N_{t + \Delta t}= N_t +1\} &= \lambda(t) \Delta t +o(\Delta t) & \textrm{ with } \lim_{\Delta t \to 0}\frac{o(\Delta t)}{\Delta t}=0 \\ P\{N_{t + \Delta t}\ge N_t +2\} & = o(\Delta t) & \end{aligned} \tag{14.2}
One way to construct a nonhomgeneous Poisson process is by sampling it in a time-dependent manner. Suppose \lambda(t) is bounded (locally in t), then we pick \lambda > \lambda(t). We consider a Poisson process M_t with constant rate \lambda, and if an event occurs at time t then we decide to keep this event with probability p(t)=\frac{\lambda(t)}{\lambda} and we discard the event with probability 1-p(t). By the same argument we used in the section Sampling a Poisson process we see the number of kept events satisfies the definition of a non-homogeneous Poisson process in Equation 14.2
Let us consider an event for the process M_t which occured in the interval [0,t]. By our analysis of arrival time we know that this event occured a time which is uniformly distributed on the interval [0,t]. Therefore the probability that this event was accepted and contribute to N_t is therefore p_t = \frac{1}{t}\int_0^t \frac{\lambda(s)}{\lambda} \, ds

By repeating then the second part of the arguement in the section Sampling a Poisson process we see that M_t has a Poisson distribution with parameter \lambda t p_t = \int_0^t \lambda(s) \, ds and in particular E[N_t] = \int_0^t \lambda(s) \, ds

Queueing model with infinitely many servers

Assume that the the flow of customers entering an onine store follows a Poisson process N_t with rate \lambda. The time S spent in the store for a single customer (browsing around, checking out, etc..) is given by tis CDF G(t)=P\{ S \le t\} and we assume that the customers are independent of each other.
To figure out how to allocate ressources one wants to figure out what is number of customers, M_t, which are still in the sytem at time t.
To find the distribution of M_t let us consider one of the customer by time t. If he arrived at time s\le t then he will have left the system at time t with probability G(t-s) and will still be in the system by time t with probability 1-G(t-s). Since the arrival time of that customer is uniform on [0,t] the distribution of M_t is Poisson with mean E[M_t] = \int_0^t (1- G(t-s))ds = \lambda \int_0^t (1 - G(s))ds,.
For large t, we see that E[M_t] \approx \lambda E[S].

Compound Poisson process

Suppose that the number of claims receieved by an insurance follows a Poisson process. The size of each claim will be different and it natural to assume that claims are independent from each other. If we look at the total claims incurred by the insurance company this leads to a stochastic process called a compound Poisson process.
A stochastic process X_t is called a compound Poisson process if it has the form X_t = \sum_{k=1}^{N_t} Y_k where N_t is a Poisson process and Y_1, Y_2,\cdots are IID random variables which are also independent of N_t.
The process X_t has stationary independent increments. Using that N_t-N_s is a Poisson process X_t-X_s = \sum_{k=N_s}^{N_t} Y_k \textrm{ has the same distribution as } X_{t-s}= \sum_{0}^{N_{t-s}} Y_k
We can compute the MGF of X_t (or its charactersitic function) by conditining on N_t. Suppose m_Y(\alpha)=E[e^{\alpha Y}] is the moment generating function of Y and using the MGF for the Poisson RV we find \begin{aligned} m_{X_t}(\alpha)& =E\left[ e^{\alpha X_t}\right] = E\left[ e^{\alpha \sum_{k=1}^{N_t} Y_k} \right] = \sum_{n=0}^\infty E\left[ e^{\alpha \sum_{k=1}^{N_t} Y_k} | N_t =n \right] P\{n_t=n\} \\ & = \sum_{n=0}^\infty m(\alpha)^n P\{n_t=n\} = e^{\lambda t ( m(\alpha) -1 )} \end{aligned}

We can compute then the mean and variance \begin{aligned} m_{X_t}'(\alpha)&= e^{\lambda t ( m(\alpha) -1 )} \lambda t m'(\alpha) \\ m_{X_t}''(\alpha)&= e^{\lambda t ( m(\alpha) -1 )} (\lambda t m''(\alpha) + \lambda t)^2() m'(\alpha)^2 ) \end{aligned} and thus E[X_t]= \lambda t E[Y] \quad \textrm{ and } Var[X_t]= \lambda t ({\rm Var}(Y) + E[Y]^2]) With a bit more work we could prove a central limit theorem.

15 Continuous time Markov chains

In this section that we build a continuous time Markov process X_t with t\ge 0. The Markov property can be expressed as P\{ X_t=j| \{X_{r}\}, 0\le r \le s \} = P\{X_t=j| X_s \} \,. for any 0 < s < t.

Exponential random variables

To construct a Markov we will need lots of exponential random variables. Recall that an exponential random variable T with paramter \lambda has the pdf f_T(t)=\lambda e^{-\lambda t}, for t \ge 0 the cdf F_T(t) = 1- e^{-\lambda t} and mean E[T]=\frac{1}{\lambda}.

Proposition 15.1 (Properties of exponential randomm variables) Let T_1, T_2, T_3, \cdots be independent exponential random variables with parameter \lambda_1, \lambda_2, \cdots. Then

T= \min\{ T_1, \cdots, T_n\} is an exponential random variables with paramter \lambda_1+ \cdots \lambda_n. Note that n=\infty is allowed if we assume that \sum_{n}\lambda_n is finite.
\displaystyle P\{ T_i = \min \{ T_1, \cdots, T_n\} \} = \frac{\lambda_i}{\lambda_1+ \cdots \lambda_n}

Proof. For 1. we have, using independence, P\left\{ T > t \right\} = P\left\{ T_1 > t, \cdots, T_n > t \right\} = P\left\{ T_1 > t \right\} \cdots P\left\{ T_m > t \right\} = e^{-(\lambda_1+ \cdots \lambda_n)t} and thus T is an exponential random variable.
For 2. we have, by conditioning, P\left\{ T_1 = T \right\} = \int_0^\infty P\{ T_2>t, \cdots, T_n > t \} f_{T_1}(t) \, dt = \int_0^\infty e^{-(\lambda_2 + \cdots+ \lambda_n)t} \lambda_1 e^{-\lambda_1 t}\, dt = \frac{\lambda_1}{\lambda_1+ \cdots \lambda_n} \blacksquare

Definition of a continuous time Markov chain

As for the Poisson process we will give two equivalent definition of the process, the first one describe infinitesinmal rate of change and leads to a system of ODEs describing the evolution of the pdf of X_t. The second definition use exponential random variables and waiting times and leads to an algorithm to simulate a continuous time Markov chain, often called the stochastic simulation algorithm.

To define a Markov process on the state space S we assign a number \alpha(i,j) for any pair of state i\not=j. You should think these numbers \alpha(i,j) = \textrm{ rate at which the chain changes from state } i \textrm{ to state } j \,. We denote \alpha(i) = \sum_{j\not= i} \alpha(i,j) = \textrm{ rate at which the chain changes from state } i \,.

Formally a continuous Markov chain with rates \alpha(i,j) is a stochastic process such that \begin{aligned} P\{ X_{t+\Delta t}=i| X_t=i\} &= 1 - \alpha(i)\Delta t + o(\Delta t) \\ P\{ X_{t+\Delta t}=j| X_t=i\} &= \alpha(i,j)\Delta t + o(\Delta t) \end{aligned}

Proceeding as for the Poisson process we can derive a differential equation for p_t(i)=P\{X_t =i\}. By conditioning we have P\{ X_{t+\Delta t}=i\} = (1 - \alpha(i)\Delta t )P\{ X_t=i\} + \sum_{j \not =i} \alpha(j,i)\Delta t P\{ X_t=j\} + o(\Delta t) which leads to the system of linear ODE’s \frac{d}{dt}p_t(i) = -\alpha(i) p_t(i) + \sum_{j \not= i} \alpha(j,i) p_t(j) \tag{15.1} called the Kolmogorov backward equations.

The infinitesimal generator of a continuous-time Markov chain is given by the matrix A = \begin{matrix} 1 \\ 2 \\ 3 \\ \vdots \end{matrix} \begin{pmatrix} -\alpha(1) & \alpha(1,2) & \alpha(1,3) & \cdots \\ \alpha(2,1) & -\alpha(2) & \alpha(2,3) & \cdots\\ \alpha(3,1) & \alpha(3,2) & -\alpha(3) & \cdots\\ \vdots & \vdots & \vdots & \end{pmatrix} and the entries of A satifies A(i,j)\ge 0 \textrm{ for } i \not= j \quad \textrm{ and } \sum_{i} A(i,i) = 0

If we use a row vector p_t = (p_t(1), p_t(2), \cdots) then we can rewrite Equation 15.1 as the system \frac{d}{dt}p_t = p_t A \tag{15.2}
If S is finite then one can write the solution in terms of the matrix exponential e^{tA}= \sum_{k=0}^\infty \frac{t^k A^k}{k!} p_t = p_0 e^{tA} \,. where p_o(i)=P\{X_0=i\} is the initial distribution.
We can also write equation for the transition probabilities (take X_0=i) P_t(i,j) = P \{ X_t =j| X_0=i\} and we obtain the matrix equation \frac{d}{dt}P_t = P_t A, \quad \textrm{ with } P_0 =I \quad \implies \quad P_t = e^{tA} which we can solve using linear algebra techniques

Example

A = \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix} \begin{pmatrix} -1 & 1 & 0 & 0 \\ 1 & -3 & 1 & 1 \\ 0 & 1 & -2 & 1 \\ 0 & 1 & 1 & -2 \end{pmatrix} The (right) eigenvalues of A are 0, -1, -3, -4 and we can diagonalize A and find D=\begin{pmatrix} 0 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & -3 & 0 \\ 0 & 0 & 0 & -4 \\ \end{pmatrix} = Q^{-1} A Q where the rows of Q are the corresponding eiegenvectors: Q=\begin{pmatrix} 1& 1 & 0 & -\frac13 \\ 1& 0 & 0 & 1 \\ 1& -\frac12 & -\frac12 & -\frac13 \\ 1& -\frac12 & \frac12 & \frac13 \\ \end{pmatrix} \quad \quad Q^{-1}=\begin{pmatrix} \frac14 & \frac14 & \frac14 & \frac14 \\ \frac23& 0 & -\frac13 & -\frac13 \\ 0& 0 & -1 & -1 \\ -\frac14& \frac34 & -\frac14 & -\frac14 \\ \end{pmatrix} Then we find P_t = e^{tA} = Q e^{tD} Q^{-1} =

Alternative description

We can also construct the Markov chain X_t using exponential random variable.

If X_t=i consider (independent) random variable T(i,j) with parameter \alpha(i,j) for j \not =i. Think of those as exponential clocks and when the first of the clocks rings, say clock j, then the Markov chain moves to state j.
Using Proposition 15.1 we see that the chain moves from the state i to j if the clocks associated to the pair (i,j) rings first and this occur with probability Q(i,j) = \frac{\alpha(i,j)}{\sum_{j \not= i}\alpha(i,j)} = \frac{\alpha(i,j)}{\alpha(i)}
The matrix Q(i,j) is the transition matrix for a discrete time Markov chain Y_n which has the same junmps probabilities as X_t but the transition occur at each unit time.
Then a new set of clocks with rates \alpha(j,k) is produced and the process starts anew until the Markov chain moves to a new state.
By Proposition 15.1 we can express this slightly differently. If in state i we use a single clock with rate \alpha(i) and when this clock rings then we move to state j according to the discrete time Markov chain Q(i,j).

The Markov property follows form the memoryless property for an exponential distribution T: P(T\ge t +s| T \ge s) = P(T>t).
If X_t=i then by construction the position after the next jump after time t clearly depends only i and not the states that the Markov chain visited before time t. Moreover if we consider the time of the last jump before time t which occured, say at time s<t, then the memoryless property of the exponential random variable implies that the time at which the jump occur after time t does not depend on s at all. Putting these together this implies the Markov property P\{X_{t+u} =j | X_{s}, 0 \le s \le t\} = P\{X_{t+u} =j | X_t \}
To connect this to the previous description we note that by conditioning on the first jump P_t(i,j) = P(X_t=j|X_0=i) = \delta(i,j) e^{-\alpha(i) t} + \int_0^t \alpha(i) e^{-\alpha(i)s} \sum_{k} {Q(i,k)} P_{t-s}(k,j) ds Iterating this equation and setting t=\Delta t we find
\begin{aligned} P(X_{\Delta t}=j|X_0=i) &= \delta(i,j) e^{-\alpha(i) \Delta t} + \int_0^t \alpha(i) e^{-\alpha(i)s} \sum_{k} \frac{\alpha(i,k)}{\alpha(i)} \delta(k,j) e^{-\alpha(k)(t-s)} ds + \cdots \\ &= \delta(i,j) e^{-\alpha(i) \Delta t} + \alpha(i,j) \Delta t \times \underbrace{ e^{-\alpha(j) \Delta t} \frac{1}{\Delta t} \int_0^{\Delta t} e^{-(\alpha(i)-\alpha(j)) s} ds }_{\to 1 \textrm{ as }\Delta t \to 0} + \cdots \\ &= \delta(i,j) (1 - \Delta t \alpha(i)) + \alpha(i,j) \Delta t + o(\Delta t) \end{aligned} The higher order terms are negligible since the probability to have two jumps in the time interval [0,\Delta t] is o(\Delta t).

Uniformizable chain

Consider a Markov chain Y_n with transition matrix Q and we assume that Q(i,i)=0. Then we pick rate \lambda(i)=1 for all states i. The times at which the Markov chain has transition is thus a sum of IID exponential, that at time descriebed by a Poisson process. In other terms we have X_t = Y_{N_t} where N_t is a Poisson process with rate 1.

In this case we can compute the transition matrix quite explicitly: \begin{aligned} P\{X_t=j|X_0=i\}& = P\{ Y_{N_t}=j| X_0=i\} \\ &= \sum_{n=0}^\infty P\{ Y_{N_t}=j, N_t=n| X_0=i\} \\ &= \sum_{n=0}^\infty P\{ Y_n=j| X_0=i\} P\{N_t=n\} \\ &= \sum_{n=0}^\infty \frac{e^{-\lambda t} (\lambda t)^n}{n!} Q^n(i,j) \end{aligned}

and the generator is given by A = (Q-I).

Stationary distributions and detailed balance

A probability vector \pi is a stationary distribution for the Markov chain with generator A if
\pi P_t = \pi \quad \textrm{ for all } t =0 which, by Komlogorov equation, is equivalent to \pi A = 0
In terms of the rate \alpha(i,j) we see that \pi is stationary if ansd only if \sum_{i \not= j} \pi(i) \alpha(i,j) = \pi(j)\alpha(j) = \sum_{i \not=j} \pi(j) \alpha(j,i) which we can interpret as a balance equation: \pi(i)\alpha(i,j) is the rate at which the chain in a state \pi changes from i to j.
As for discrete time we say that a Markov chain satisfies detailed balance if \pi(i) \alpha(i,j) = \pi(j) \alpha(j,i) \textrm{ for all } i \not= j and detailed balance obviously implies stationarity.

The Markov chain with generator A is irreducible if for any pair of states i,j we can find states i_1, \cdots, i_{N-1} such that \alpha(i,i_1) \alpha(i_2, i_3) \cdots \alpha(i_{N-1},j) >0 If there exists a stationary distribution for an irreducible chain then \pi(i)>0 for all i. Indeed if \pi(i)>0 and \alpha(i,j)>0 then \pi A(j) =0 implies that \pi(j)\alpha(j) = \sum_{k}\pi(k)\alpha(k,j) \ge \pi(i)\alpha(i,j)>0 and thus \alpha(j)>0.
The issue of periodicity cannot occur for continuous time Markov chains

Lemma 15.1 For an irreducible Markov chain with generator A, P_t(i,j)>0 for all i,j and all t>0.

Proof. Using the Markov property \begin{aligned} P\{X_t=j|X_0=i\} &\ge P \{ X_{t/N}=i_i, X_{2t/N}=i_2, \cdots X_{t}=j |X_0=i\} \\ &= P \{ X_{t/N}=i_1| X_0=i\} P\{ X_{2t/N}=i_2| X_{t/N}=i_i\} \cdots P\{ X_{t}=j |X_{t\frac{N-1}{N}}=i_{n-1}\} \end{aligned} and for example
P \{ X_{t/N}=i_1| X_0=i\} \ge \int_0^{t/N} \alpha(i) e^{-\alpha(i)s} Q(i,i_1) e^{-\alpha(i_1)(t-s)}>0 \blacksquare

Convergence to the stationary distribution

If the state space is finite and the chain is irreducible then we have convergence to equilibrium.

Theorem 15.1 If the state space S is finite and the Markov chain with generator A is irreducible then for any initial distribution \mu we have \lim_{t \to \infty}\mu P_t = \pi

Proof. We use a spectral argument and the result for discrete time Markov chain. Pick a number a such that a is strictly greater than all entries of A (this is possible since A is finite). Consider the matrix Q= \frac{1}{a}A + I. Then Q is a stochastic matrix and the Markov chain Y_n with transition matrix Q is irreducible and aperiodic (because Q(i,i)>0). Then from Theorem 6.3 we know (details in the exercises) that Q has a simple eigenvalue 1 and all other eigenvalues \lambda satisfy |\lambda|<1. Now Qf = \lambda f \iff Af =a(\lambda -1) f and so 0 is a simple eigenvalue for A and the other eiegenvalue are of the form a(\textrm{Re}(\lambda) -1) + i a \textrm{ Im }(\lambda) and so the real part is strictly negative.

The vector 1=(1,1, \cdots,1 )^T is a right eigenvector for A and \pi is a left eigenvector for A. If we define the matrix \Pi as the matrix whose rows are equal to \pi, we have then (P^t - \Pi) \begin{pmatrix} 1\\ \vdots \\1 \end{pmatrix}= 0 \textrm{ and } \pi(P^t - \Pi) =0\,. Moreover if f the right eigenvector and g the left eigenvector for A for the eigenvalue \mu \not=0 then we have \pi (Af) = \mu \pi f = (\pi A) f =0 \quad \textrm{ and } (g A) 1 = g (A1) = \mu g 1 =0 and thus we must have \pi f=0 and g 1 =0. Therefore
P^t - \Pi)f = P^t f = e^{\mu t} f \textrm{ and } g(P^t - \Pi) = gP^t = e^{\mu t} g. This implies that P^t - \Pi has the simple eigenvalue 0 and the same eigenvalues e^{\mu t} as P^t and \mu has strictly negative real part. Therefore P^t - \Pi converges to 0 as t \to \infty, or \lim_{t \to \infty} P_t(i,j)= \pi(j) .

Transient behavior

To study the transient behavior in continuous time we can use similar ideas as in discrete time.

Absorption probabilities: the absorption probabilities do not depend on the time spent in every state so they can be computed using the transition matrix Q(i,j) for the embedded chain Y_n and the formula in Section Absorption probabilities
Expected hittiing time: For example we have the following result

Theorem 15.2 Supose X_t is a Markov chain with generator A and let \sigma(j) = \inf\{t \ge 0 ; X_t=i\} be the first hitting time to state j. Let \tilde{A} the matrix obtained by deleting the j^{th} and j^{th} column from the generator A. Then we have E[\sigma(j)|X_0=i] = \sum_{l \not= j} B(i,l) \quad \textrm{ where } B= \tilde{A}^{-1} The matrix \tilde{A} has rowsums which are non-positive and at least one of the row must be strictly negative.

Proof. By conditioning on the first jump which happens at time T we have E[\sigma(j)|X_0=i] = \underbrace{E[T|X_0=i]}_{\textrm{expected time until the first jump }} + \sum_{k\in S, k\not=j} P\{ X_T = k|X_0=i\} \underbrace{E[ \sigma(j)|X_0=k]}_{\textrm{expected hitting time} \atop \textrm{ from the state after the first jump}} If we set b(i)=E[\sigma(j)|X_0=i] (for i \not= j) we find the equation b(i)= \frac{1}{\alpha(i)} + \sum_{k \not= j} \frac{\alpha(i,k)}{\alpha(i)} b(k) \implies 1 = \alpha(i) b(i) - \sum_{k\not= j}\alpha(i,k) b(k) which reads, in matrix form as 1 = -\tilde{A} b \implies b= (-\tilde{A})^{-1} 1. To show that -\tilde{A} is invertible we consider the matrix Q= \frac{1}{a}\tilde{A} + I where a is chosen larger than all the entries. Then the entries of Q are non-negative and the rowsums do not exceed one with at least one strictly less than 1. By the results for discrete time Markov chains we know that I-Q=(-\frac{1}{a} A) is invertible.

Explosion

For a continuous time Markov chain X_t let us consider the time of the successive jumps S_1=T_1, S_2=T_1+T_2, S_3=T_1+T_2+T_3, \cdots Here the T_i are independent exponential but, in general, no identically distributed with paramters \alpha_i which depends on the state being visited. We have then E[S_n]= \sum_{i=1}^n \frac{1}{\lambda_i}
Explosion: To see what can happen consider a Markov chain with rate \alpha(i,i+1)=(n+1)^2 and all other rates equal to 0. Then the Markov chain moves up by 1 at every jump like a Poisson process but at accelerated pace. We have then E[S_\infty]= \sum_{n=1}\frac{1}{n^2} < \infty so S_\infty < \infty with probability 1. So there are infinitely many jumps in finite time and X_t=+\infty after a finite time. This is called explosion.

This is an issue familiar in ODE: the equation \frac{d}{dt} x_t = x_t^2 has solution has solution x_t= \frac{x_0}{x_0-t} which blows up at time t=x_0.
It is not easy to determine if an explosion really occurs. Indeed for no explosion to occur we must have, with probabilty 1, \sum_{n} \frac{1}{\alpha(Y_n)} = \infty where Y_n is the embedded chain.
A sufficient condition for non-explosion is a suitable upper bound on the rates \alpha(i), say \alpha(i) \le \alpha which is true for finite state spaces but this is by no means necessary.

Transience, recurrence, and positive recurrence.

The recurrence and transience for a continuous time Markov chain can be defined as for discrete time. A an irreducbile Markov chain X_t is recurrent if the chain starting from a state i returns to i with probability 1 and transient if X_t never return to i with some non-zero probability.
Transience/recurrence has nothing to do with the actual time spent in every state and so we have immediately X_t \textrm { with rates } \alpha(i,j) \textrm{ is } \begin{array}{l} \textrm{transient} \\ \textrm{recurrent}\end{array} \iff Y_n \textrm{ with transition } Q(i,j)=\frac{\alpha(i,j)}{\alpha(i)} \textrm{ is } \begin{array}{l} \textrm{transient} \\ \textrm{recurrent}\end{array}
For positive recurrence we need to define the first return to a state i. This is most easily defined in terms of the return for the embeded Markov chain Y_n and its first return time \tau(i). The time until X_0 returns to i will the sum of the time spent in each state. If S_n denotes the time of the jumps and \tau(i)=n after visiting the state j_0, j_1, \cdots j_{n-1}, j_n=i then the return time is \sum_{k=0}^{n-1}(S_{k+1}- S_k) where S_{k+1}-S_k is exponential with paramter \alpha(j_k). The first return time to state i for the Markov chain X_t is given \Sigma(i) = \sum_{k=0}^{\tau(i)-1}(S_{k+1}- S_k)

Stationary distribution for recurrent chains

Theorem 15.3 If the Markov chain with rates \alpha(i,j) is irreducible and recurrent, then there exists a unique solution (up to a multiplicative constant) \eta=(\eta(1), \eta(2), \cdots) to the equation \eta A=0 with 0 < \eta(i) < \infty \,. If it holds that \sum_{i}\eta(i) < \infty then \eta can be normalized to a stationary distribution and X_t is positive recurrent.

Proof. The equation \eta A(k)=0 can be written as \sum_{j \not= k} \eta(j) \alpha(j,k) = \alpha(k) \eta(k) \iff \sum_{j \not= k} \eta(j)\alpha(j) Q(j,k) = \eta(j)\alpha(j) That is the row vector \mu with entries \mu(k)=\alpha(k) \eta(k) must satisfy \mu Q = \mu.

If X_t is recurrent then the embedded Markov chain Y_n with transition Q is recurrent and so by Theorem 9.1 there exists a solution to \mu Q=\mu and therefore we found a solution for \eta A=0.

Moreover, from Theorem 9.1 we have a the representation \mu(j)=\alpha(j) \eta(j) = E\left[ \sum_{k=0}^{\tau(i)-1} {\bf 1}_{\{Y_k=j\}}\right] where i is some fixed but arbitrary reference state (this counts the number of visits to the state j between two consecutive visits to the reference state i)

If we denote by S_n the time of the n^{th} jump for X_t we have \begin{aligned} \eta(j) &= \sum_{k=0}^\infty E\left[ \frac{1}{\alpha(j)} {\bf 1}_{\{Y_k=j\}} {\bf 1}_{\{\tau(i) >k\}} \right] = \sum_{k=0}^\infty E\left[ (S_{k+1} -S_k) {\bf 1}_{\{Y_k=j\}} {\bf 1}_{\{\tau(i) >k\}} \right] = E\left[ \sum_{k=0}^{\tau(i)-1} (S_{k+1} -S_k) {\bf 1}_{\{Y_k=j\}} \right] \end{aligned} which is nothing but the time spent (by X_t) in the state j between successive visits to i.

If \sum_{j} \eta(j) <\infty then we have E[\Sigma(j)]=E\left[ \sum_{k=0}^{\tau(i)-1} (S_{k+1} -S_k)\right] <\infty which is the expected return time to state i. That is the chain X_t is positive recurrent.

Ergodic theorem for positive recurrent Markov chains

We have the following theorem which is the exact counterpart of the discrete time case (and is proved very similarly so we will omit the proof).

Theorem 15.4 Suppose X_t is irreducible and positive recurrent. Then X_t has a unique stationary distribution, and with probability 1, the time spent in state j, converges to \pi(j) \lim_{t\to \infty}\int_0^t \mathbf{1}_{\{X_s=j\}} \, ds = \pi(j). Moreover we have Kac’s formula: \pi(j) is also equal to the average time between consecutive visits to state j: \pi(j) = \frac{1}{E[\Sigma(j)|X_0=j]}. Conversely if X_t has a stationary distribution then X_t is positive recurrent.

Birth and death process

A general birth and death process is a continuous time Markov chain with state space \{0,1,2,\cdots\} and whose only non-zero transition rates are \begin{aligned} &\lambda(n) = \alpha(n,n+1) & = \textrm{ birth rate for a population of size } n & (\textrm{ for } n\ge 0) \\ &\mu(n)=\alpha(n,n-1) & = \textrm{ death rate for a population of size } n & (\textrm{ for } n\ge 1) \end{aligned}
The Kolmogorov equations for the distribution of X_t are, for n\ge 1 \frac{d}{dt}p_t(n) = \underbrace{ \mu(n+1) p_t(n+1)}_{\textrm{ increase due do death} \atop \textrm{in a population of size } n+1} + \underbrace{\lambda_{n-1} p_t(n-1)}_{\textrm{ increase due do birth} \atop \textrm{in a population of size } n-1} - \underbrace{(\lambda(n) + \mu(n))p_t(n)}_{\textrm{ decrease due do birth/death} \atop \textrm{in a population of size} n}. For n=0, the eqation reads \displaystyle \frac{d}{dt}p_t(0)= \mu(1) p_t(1) - \lambda(0) p_t(0).
The generator has the form \begin{matrix} 0 \\ 1 \\ 2\\ \vdots \end{matrix} \begin{pmatrix} -\lambda(0) & \lambda(0) & 0 & 0 & \dots \\ \mu(1) &-\lambda(1) -\mu(1) & \lambda(1) & 0 & \dots \\ 0 & \mu(2) &-\lambda(2) -\mu(2) & \lambda(2) & \dots \\ \vdots & \ddots & \ddots & \ddots & \\ \end{pmatrix}

The transition matrix for the embedded process Y_n is

\begin{matrix} 0 \\ 1 \\ 2\\ \vdots \end{matrix} \begin{pmatrix} 0 & 1 & 0 & 0 & \dots \\ \frac{\mu(1)}{\mu(1)+ \lambda(1)} & 0 & \frac{\lambda(1)}{\mu(1)+ \lambda(1)} & 0 & \dots \\ 0 & \frac{\mu(2)}{\mu(2)+ \lambda(2)} & 0 & \frac{\lambda(2)}{\mu(2)+ \lambda(2)} & \dots \\ \vdots & \ddots & \ddots & \ddots & \\ \end{pmatrix}

References

Boucheron, S., G. Lugosi, and P. Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.

Hajek, Bruce. 1988. “Cooling Schedules for Optimal Annealing.” Mathematics of Operations Research 13 (2): 311–29. http://www.jstor.org/stable/3689827.

Lawler, Gregory F. 2006. Introduction to Stochastic Processes.

Levin, David Asher, James Propp, David B. Wilson, Elizabeth L. Wilmer, and Y. Peres. 2017. Markov Chains and Mixing Times. American Mathematical Society.

Ross, Sheldon M. 2013. Simulation. Academic Press.

———. 2019. Introduction to Probability Models. Twelfth edition. London, United Kingdom: Academic Press.

Rubinstein, Reuven Y., and Dirk P. Kroese. 2017. Simulation and the Monte Carlo Method. Third edition. Hoboken New Jersey: John Wiley & Sons.