Probability Theory

Probability: Study of regular patterns that arise from random phenomena

Statistics: The inverse problem of making inferences from observations of patterns in random phenomena.

1 Preliminary: Measure Theory

Measures are defined to characterize the ”mass” of a set, which is crucial in defining the Lebesgue integral, probability measure, etc. In order to define measures, we need first assume some structures on the space of sets, i.e., σ -Algebra.

1.1 σ -Algebra

Definition 1.1 ( σ -Algebra).

A σ -algebra on X is a family of subsets of X , , such that,

  • , X

  • If A , then A C

  • Let be a countable set. If A i , for i , then i = 1 A i and i A i .

Definition 1.2 (Borel σ -Algebra).

Borel σ -Algebra B ( X ) is the σ -Algebra generated by open sets in X .

B ( X ) = B ( 𝒯 X ) (1)

where ( X , 𝒯 X ) is a topological space.

1.2 Measures

Definition 1.3 (Measures).

A measure μ : [ 0 , ] is defined on a space with σ -Algebra ( X , ) , such that

  • μ ( ) = 0 .

  • μ ( i = 1 A i ) = i = 1 μ ( A i ) , for disjoint sets { A i } .

This space is then called measurable space ( X , , μ ) .

Definition 1.4 (Probability measure).

A probability measure μ on X is a specific measure with μ ( X ) = 1 .

Other measures:

  • Dirac measure:

    δ t ( A ) := { 1 , t A 0 , t A (2)
  • Counting measures

    # ( A ) = #  elements in  A (3)

1.3 Some related definitions

Finite measure: If μ ( X ) < , then μ is finite.

σ -finite measure: If we can cover X by countably many measurable sets each with finite measure, we call it a σ -finite measure.

A straightforward claim: finite measures are σ -finite measures. ”Length” on is a σ -finite measure but not finite.

Negligible: If a measurable set A has μ ( A ) = 0 then it is a negligible set for μ .

Almost everywhere: If a measurable set A has μ ( A C ) = 0 then it is μ -almost everywhere, denoted as μ a.e.

1.4 Lebesgue measure

Borel measure: B ( ) [ 0 , ] .

Lebesgue measures are defined on Borel σ -Algebra. For example, on :

λ ( A ) := inf { i = 1 | b i a i | : A i = 1 ( a i , b i ) } , A B ( ) . (4)

Intuition: to find the minimum union of intervals that covers A . (Note: union of intervals on can always be written as a countable union of intervals).

  • It is a measure

  • λ ( ( a , b ) ) = | b a | , for a < b .

  • Translation invariance: λ ( A + t ) = λ ( A ) for all A B ( ) .

Lebesgue measure is the only Borel measure that satisfies the above properties.

Another example of Borel measure: is the cumulative distribution function, which is finite and monotone.

1.5 Integration

Riemann Integral. Let a b . We say f is Riemann Integrable if the following limit exists:

lim max Δ x k 0 k = 1 n f ( x k * ) Δ x k , (5)

where x k * is an arbitrary point in the interval Δ x k .

However, Riemann integral may not exist. For example, f n ( q i ) = 1 for rational numbers in [ 0 , 1 ] and f n ( x ) = 0 otherwise. No matter how small an interval Δ x k could be, it contains rational numbers and irrational numbers.

Definition 1.5 (Decreasing Rearrangement).

For a function f : + , the dereasing rearrangement h : + + is defined as following:

h ( t ) := λ { x L f ( x ) > t } 𝑓𝑜𝑟 t 0 . (6)

h ( t ) is a decreasing function, as measure h is monotonic.

Remark. Given the definition of h , our expectation is to define integration by:

f ( x ) λ ( 𝕕 x ) = 0 λ { x : f ( x ) > t } d t = 0 h ( t ) d t . (7)

The intuition of this definition is: instead of counting the area by x -axis, we count the area by y -axis.

In order to ensure this integration works, we need { x : f ( x ) > t } to be measurable. The definition below of measurable function solves this problem.

Definition 1.6 ((Borel) Measurable function).

Let ( X , ) be a measurable space. A function f : X is Borel-measurable if

f 1 ( ( t , ) ) := { x X : f ( x ) > t } (8)
Proposition 1.7.

A function f : X is Borel measurable if and only if

f ( 1 ) ( B ) , (9)

for all Borel B ( ) .

Properties of measurable functions:

  • If f , g measurable, f g is measurable (for X = ),

  • f + g is measurable,

  • f + ( x ) := max { f ( x ) , 0 } , f ( x ) = max f ( x ) , 0 are measurable,

  • | f | is measurable,

  • f g is measurable.

  • If f i : X { i n f t y , } are measurable functions, sup i f i , inf i f i are measurable.

Given the rigorous definition above, we can finally define Lebesgue integration. We first define it on positive functions and use it as a tool to define Lebesgue-integrable function in general.

Definition 1.8 (Lebesgue integral of positive function).

Let μ be a Borel measure. Let f : + { } be a positive, measurable function. Then we define the Lebesgue integral of f with respect to μ with

f ( x ) μ ( d x ) = 0 μ { x : f ( x ) > t } d t (10)
Definition 1.9 (Lebesgue integrable function).

A finite valued measurable function f : is integrable with respect to Borel measure μ if

| f ( x ) | μ ( d x ) < . (11)

Its Lebesgue integral is defined as

f ( x ) μ ( d x ) := f + ( x ) μ ( d x ) f ( x ) μ ( d x ) . (12)

notated by μ ( f ) .

Proposition 1.10.

A bounded function over a bounded domain f : [ a , b ] is a Riemann integrable function, then f is Lebesgue integrable with respect to the Lebesgue measure λ .

f [ a , b ] f ( x ) λ ( d x ) = a b f ( x ) d x . (13)
Proposition 1.11 (Monotone Convergence).

Let ( f j : X + { } ) j are positive, measurable functions. f j ( x ) increasing in terms of j , and lim j f j ( x ) = f ( x ) for all x X . Then the integral

lim j μ ( f j ) = μ ( f ) . (14)

2 Mathematical Foundation of Probability

Developed by Kolmogorov in 1933, an axiomatic foundation for the theory of probability.

Definition 2.1 (Probability Space, ANK 1933).

A triple ( Ω , , ) , where

  • Ω is a set called the sample space.

  • is a σ -Algebra of Ω .

  • is a probability measure on ( Ω , ) assigning probability to events.

Notes:

  • ω Ω are called sample points, outcomes of a probability experiment in simple cases.

  • identifies sets of sample points that are assigned probabilities. E are called events.

  • When ω 0 E occurs, we say the event E occurs.

  • If Ω is finite or countable, then we can take = P ( Ω ) . However for n , the power set is too big for a reasonable measurable σ -Algebra, so we take Borel σ -Algebra B ( n ) .

States in Ω give a complete system description, but it’s often too detailed and inaccessible. Instead, we are more interested in observables, or system statistics, which are called random variables in probability theory.

Definition 2.2 (Real Random Variable).

Let Ω , , be a probability space. A real random variable is a measurable function

X : Ω (15)

Remarks.

X is a fixed function, so where does the ”randomness” come from? In fact, all the randomness comes from the X ( ω ) where ω is random in the probability space.

Distributions of probability over the events induce a distribution of probability over the values in of the random variable.

The term ”measurable function” used above is defined as:

Definition 2.3 (Measurable function).

A measurable function X : Ω has the property that X 1 ( B ) for all B ( ) .

Based on this property, we can compute the probability that X takes a value in B ( ) :

( X B ) := { ω Ω : X ( ω B } = ( X 1 ( B ) ) . (16)
Definition 2.4 (Law of a real random variable).

Let Ω , , be a probability space. Let X : Ω be a real random variable.

The law (or distribution) of X is the Borel probability measure μ X on ( , ( ) ) , s.t.,

μ X ( B ) = { X B } , B ( ) . (17)

Notes. The probability that X B can be expressed as

( X B ) = 𝟙 B ( x ) μ X ( d x ) . (18)
Definition 2.5 (Distribution function).

Let X be a real random variable on a probability space with law μ X . Define

F X ( a ) := { X a } = μ X ( ( , a ] ) , a . (19)

This is called a cumulative distribution function (CDF) of X .

Note: It is important to make sure it’s a but not a < in the definition.

Properties of CDF:

  • Monotonicity

  • Asymptotic

  • Right-continuous

  • Law: μ X ( a , b ] = F X ( b ) F X ( a ) .

Continuous random variables: Law has a density with respect to the Lebesgue measure λ .

μ X ( B ) = B f X ( x ) λ ( d x ) . (20)

where f X ( x ) is the probability density function. In particular, F X is absolutely continuous.

Singular continuous: F X is continuous but μ X has no densit1y functions.

3 Expectation

3.1 Definitions

Definition 3.1 (Expectation).

Let ( Ω , , ) be a probability space and X : Ω be a real random variable.

The expectation 𝔼 [ X ] is the real number given by the integral:

𝔼 [ X ] := Ω X ( ω ) ( d ω ) = Ω X d . (21)

More formally, we define it using the Lebesgue integral:

𝔼 [ X ] := 0 { X > t } λ ( d t ) ,  when  X 0 . (22)

For real X (not necessarily positive), if 𝔼 [ | X | ] < , we define 𝔼 [ X ] := 𝔼 [ X + ] 𝔼 [ X ] .

Definition 3.2 (Integrable random variable.).

For a probability space ( Ω , , ) , we define

L 1 := L 1 { Ω , 1 , } = { X : Ω : 𝔼 [ | X | ] < } . (23)

Note: Not all random variables are integrable, e.g., Cauchy random variables.

3.2 Change of variables

How to compute the expectation using the law? We have the change of variables proposition.

Proposition 3.3 (Law of the unconscious statistician, Lotus).

Let X be a real random variable with law μ X . Let h : be a function that is μ X -integrable. Then

𝔼 [ h ( X ) ] = h ( x ) μ X ( d x ) (24)

Example. If X is continuous with a density f X ,

𝔼 [ X ] = x μ X ( d x ) = x f X ( x ) λ ( d x ) (25)

Expectation is linear. 𝔼 [ α X 1 + X 2 ] = 𝔼 [ α X 1 ] + 𝔼 [ X 2 ] .

3.3 Convexity

Recall the (informal) definition of convex functions in :

f ( λ x 1 + ( 1 λ ) x 2 ) λ f ( x 1 ) + ( 1 λ ) f ( x 2 ) , x 1 , x 2 d o m ( f ) , d o m ( f )  convex . (26)
Proposition 3.4 (Subgradient property).

Suppose f : d is a convex function. Then for all x , y d , there exists a subgradient vector g x d .

f ( y ) f ( x ) + y x , g x . (27)
Theorem 3.5 (Jensen’s inequality).

Let f : convex. Assume f is bounded below. Let X be an integrable random variable, then

𝔼 [ f ( x ) ] f ( 𝔼 [ X ] ) . (28)
Proof Sketch..

Use the subgradient property with x = 𝔼 [ X ] and the inequality should pop out directly. ∎

4 Moments and Tails

How do we collect information about the distribution of a random variable?

4.1 Definitions

Definition 4.1 (Moment).

Let X be a real random variable with law μ X . A moment is an integral of a real test function ( μ X -integrable) h : against the law:

𝔼 [ h ( X ) ] = h ( x ) μ X ( d x ) (29)

Examples:

  • Indicator h = 𝟙 B for Borel B , which gives μ X ( B ) .

  • 1 s t order moment, h ( x ) = x . Then m 1 = 𝔼 [ X ] = x μ X ( d x ) .

  • n -th polynomial moment: h ( x ) = x n for n . m n = 𝔼 [ X n ] .

  • Exponential moment: let h ( x ) = exp θ x . 𝔼 [ e θ X ] .

Definition 4.2 (Tails).

Let X be a real random variable, t . The (right) tail is defined as ( X t ) . We often refer to the tail probability ( | X | t ) .

4.2 Moments and Tail Bounds

Several theorems on tail bounds bridge a connection between moments and tails.

Theorem 4.3 (Markov’s inequality).

Let X be a real random variable and X 0 .

[ X t ] 1 t 𝔼 [ X ] , t > 0 . (30)

Remark. For positive, increasing φ : + = + , we have

{ X t } 1 φ ( t ) 𝔼 [ φ ( X ) ] , t > 0 . (31)

We can use φ ( x ) = x p to introduce higher-order information of X to the tail bound. φ ( x ) = exp c x sometimes yields a tighter tail bound of exponential decay, i.e., Chernoff bound.

In other words, polynomial moments control tail probabilities.

Theorem 4.4 (Integral by parts).

Let X be a positive real random variable and φ : + be an increasing and continuously differentiable function. Then,

𝔼 [ φ ( X ) ] = φ ( 0 ) + 0 { X t } φ ( t ) d t . (32)

Remark.

  • When φ ( x ) = x q for some q > 0 , then

    𝔼 [ | X | q ] = 0 { | X | t } q t q 1 d t . (33)

This means tail probability controls moments.

Assume X is a real random variable with { | X | t } = O ( t p ) for some p 1 . Then we can bound the q -th moment for q < p .

𝔼 | X | q = 0 { | X | t } q t q 1 d t (34)
= 0 1 { | X | t } q t q 1 d t + 1 { | X | t } q t q 1 d t (35)
0 1 q t q 1 d t + 1 C q t q p 1 d t (36)
= 1 + C q p q < + (37)

5 L p Spaces

Definition 5.1 ( L p space).

For p > 0 , the space L p := L p { Ω , , } is

L p := { X : Ω : 𝔼 | X | p < } . (38)

Roughly, random variables with tail decay rate at least t p . These random variables are called “ p -integrable random variables”.

Remarks. L p is a linear space.

Definition 5.2 (Homogeneous p-th moment.).

For p > 0 , we define the homogeneous p-th moment of X by

X p := ( 𝔼 | X | p ) 1 / p (39)

Sometimes denoted as X L p .

Theorem 5.3 (Lyapunov inequality).

For 0 < p q , real random variables X have

X p X q . (40)

Therefore,

X q X p . (41)
Proof Sketch..

Use Jensor’s inequality for φ ( t ) = | t | q / p . ∎

Warning. For sequence spaces, p q . For Lebesgue spaces p ( ) q ( ) .

5.1 Pseudonorm

Question: Is L p a norm? We would need to prove the triangular inequality.

Theorem 5.4 (H o ¨ lder’s inequality).

Let X , Y be real random variables with X L p , Y L q with 1 p + 1 q = 1 , p , q > 1 .

X Y 1 X p Y q . (42)
Proof.

We use a lemma called Young’s inequality. If 1 p + 1 q = 1 , p > 1 , then | x y | 1 p | x | p + 1 q | y | q . We consider X / X p and Y / Y q , and take expectation. ∎

Theorem 5.5 (Minkowski).

Assume p 1 . Let X , Y L p . Then

X + Y p X p + Y p . (43)
Proof.
| X + Y | p = | X + Y | | X + Y | p 1 | X | | X + Y | p 1 + | Y | | X + Y | p 1 . (44)

Apply Holden with 1 p + p 1 p = 1 .

𝔼 | X + Y | p ( 𝔼 | X | p ) 1 / p ( 𝔼 | X + Y | p ) p 1 p + ( 𝔼 | Y | p ) 1 / p ( 𝔼 ( | X + Y | p ) p 1 / p ) (45)

After dividing 𝔼 | X + Y | p 1 , the triangular inequality pops out. ∎

Proposition 5.6 ( L p pseudonorm).

For random variables in L p , p is a pseudonorm

  • Positive semi-definite: X p 0 and 0 p = 0 .

  • Positive homogeneous: α X = | α | X p .

  • Triangle inequality: X + Y p X p + Y p .

  • Almost positive: X p = 0 implies X = 0 μ -almost sure.

We can thus define a pseudometric on L p :

D ( X , Y ) := X Y p , X , Y L p . (46)

5.2 Convergence in L p

Definition 5.7.

A sequence ( X j : j ) in L p converges in L p if there is a real random variable Y L p , such that

X j Y p 0 , a s j . (47)

Notes. X j Y in L p implies X j Y in L q for q p .

Question: When does a sequence in L p converge?

Definition 5.8 (Cauchy sequence).

A sequence ( X j : j ) in L p is called Cauchy if

sup i , j N X i X j p 0 a s N . (48)

It is trivial that any converging sequence is Cauchy, but the other direction is not trivial.

Theorem 5.9 ( L p space is complete.).

Every Cauchy sequence in L p converges to a random variable. Moreover, it converges to a random variable in L p .

Notes. An example of non-complete space: . A sequence of rational numbers may converge to an irrational number.

Remark. Limits in L p are not necessarily unique! They are only required to be equal μ -almost surely.

5.3 L 2 Space

Next, we restrict our discussion to L 2 space, which is special for setting orthogonality, variance, covariance, and orthogonal projection.

L 2 := L 2 ( Ω , , ) = { X : Ω : 𝔼 | X | 2 < } . (49)
Theorem 5.10 (Cauchy-Schwarz).

If X , Y L 2 , then X Y L 1 , and

| 𝔼 [ X Y ] | X Y 1 X 2 Y 2 . (50)

This is a special case of H o ¨ lder’s theorem when p = q = 2 . Nevertheless, here is another simpler proof

Proof.

Let t .

0 𝔼 ( t | X | + | Y | ) 2 = 𝔼 | X | 2 t 2 + 2 t 𝔼 | X Y | + 𝔼 | Y | 2 , (51)

which is a quadratic function of t . Hence 4 ( E | X Y | ) 2 4 𝔼 | X | 2 𝔼 | Y | 2 0 . ∎

Definition 5.11 ( L 2 (pseudo)-inner product).

For random variables X , Y L 2 , we define

X , Y L 2 := 𝔼 [ X Y ] . (52)

Remarks.

  • This is well-defined by Cauchy-Schwarz.

  • X , X = 𝔼 [ | X | 2 ] = X L 2 2 .

Definition 5.12 (Orthogonality).

If X , Y = 0 , then we say X and Y are orthogonal random variables, and we write X Y .

Warning: X Y does NOT imply that X , Y are ”independent”.

Definition 5.13 (Covariance).

Let X , Y L 2 , the covariance of X , Y is

C o v ( X , Y ) := X 𝔼 X , Y 𝔼 Y . (53)

If C o v ( X , Y ) = 0 , we say X , Y are uncorrelated, not necessarily independent!

Moreover, we define the variance of a random variable X L 2 by V a r ( X ) := C o v ( X , X ) .

The correlation of X , Y is defined by

ρ ( X , Y ) := C o v ( X , Y ) V a r ( X ) V a r ( Y ) [ 1 , 1 ] (54)
Proposition 5.14 (Pythagorean).

If X , Y L 2 , then C o v ( X , Y ) = 0 implies

V a r ( X , Y ) = V a r ( X ) + V a r ( Y ) . (55)
Theorem 5.15 (Orthogonal projection).

Let K L 2 be a complete linear subspace of L 2 ( Ω , , ) .

For X L 2 , there is a random variable Y K such that

  • Primal: X Y 2 = inf { X w 2 : w K } .

  • Dual: ( X Y ) Z for all Z K .

These two statements are essentially the same. We call Y a version of the orthogonal projection of X onto K .

Remark. Such Y are not unique. We can have Y = Y almost surely.

Proof.

Consider a minimizing sequence in K : ( Y i K : I ) , such that Y i X 2 is decreasing and converges to d := inf { w X 2 : w K } .

Step 1: Show that ( Y i ) has a limit in K , which is a candidate for the projection. Check that the sequence ( Y i ) is Cauchy.

0 1 2 ( Y i Y j ) 2 = 1 2 Y i X 2 + 1 2 Y j X 2 1 2 ( Y i + Y j ) X 2 d 2 + ( d 2 ) = 0 . (56)

So Y i is Cauchy in K . Since K is complete, we know Y i Y for some Y K . Y is a candidate for the orthogonal projection of X onto K .

Step 2: Check that Y is indeed an orthogonal projection. By Minkowski, for each i

X Y 2 X Y i 2 + Y i Y 2 , (57)

which converges to d . Moreover, X Y 2 d , so X Y 2 = d .

Step 3: Dual characterization. For Z K , let t , Y + t Z K since K is a linear subspace. Then

X ( Y + t Z ) 2 X Y 2 (58)

Expand the equation and it’s trivial to see we must have X Y , Z = 0 . ∎

6 Independence

Let ( Ω , , ) be a probability space. How to update our knowledge about an event A , when another event E has occurred?

( A | E ) := ( A E ) ( E ) (59)

We define | E the σ -algebra { A E | A } .

Then ( E , | E , ( | E ) ) is a probability measure space.

Definition 6.1 (Independence of events).

Suppose knowledge that E occurs does not change the probability of A :

( A | E ) = ( A ) (60)

Then we call events A and E are independent. It is equivalent to saying

( A E ) = ( A ) ( E ) . (61)

6.1 Independent for random variables

How about random variables being independent? Random variables can take any value in their domain, so we must consider a series of events to be independent.

Definition 6.2 (Independence of random variables).

Let X , Y be real random variables. We say that ( X , Y ) are independent when

{ X A Y B } = { X A } { X B } , (62)

for all A , B ( ) .

In particular, for all a , b , it is common to see the definition below,

{ X a Y b } = { X a } { Y b } . (63)

Fact: These two definitions are equivalent.

This is also equivalent to

μ X Y ( A × B ) = μ X ( A ) μ X ( B ) , A , B ( ) . (64)

Moreover, it is equivalent to μ X Y = μ X × μ Y on ( 2 ) .

Proposition 6.3.

Let X , Y be independent real random variables. Let f L 1 ( μ X ) and g L 1 ( μ Y ) . Then

𝔼 [ f ( X ) g ( Y ) ] = 𝔼 [ f ( X ) ] 𝔼 [ g ( Y ) ] . (65)
Proof.
𝔼 [ f ( X ) g ( Y ) ] = 2 f ( x ) g ( y ) μ X Y ( d x d y ) (66)
= 2 f ( x ) g ( y ) μ X ( d x ) μ Y ( d y ) (67)
= f ( x ) μ X ( d x ) g ( y ) μ Y ( d y ) = 𝔼 [ f ( X ) ] 𝔼 [ g ( Y ) ] . (68)

Question: Can we generate a collection of random variables with specified marginals? For a countable sequence, the answer is yes!

Theorem 6.4 (Kolmogorov extension).

Let ( μ i : i ) be a collection of Borel probability measures on . Define

Ω = = { ω = ( ω 1 , ω 2 , , ) : ω i , i } . (69)

Consider coordinate random variables X i ( ω ) = ω i .

We can define product σ -algebra:

= σ ( X i : i ) =  smallest sigma algebra that have coordinate rvs measurable. (70)

There exists a product probability measure in ( Ω , ) , such that

{ ( x 1 , x 2 , ) B 1 × B 2 × } = i ( X i B i ) = i μ i ( B i ) . (71)

6.2 Independence for sigma-algebras

Note that σ -algebras carry information. We can also define the independence of σ -algebras.

Definition 6.5 (Independence of σ -algebra).

Let ( Ω , , ) be a probability space. Let 1 , 2 be sub- σ -algebra of .

We say that 1 , 2 are independent if for all E 1 1 and E 2 2 ,

( E 1 E 2 ) = ( E 1 ) ( E 2 ) . (72)

For events A , B . σ ( { A } ) = { , A , A C , Ω } . We can check that A and B are independent if and only if σ ( { A } ) and σ ( { B } ) are independent σ -algebras.

For real random variables: σ ( X ) := σ ( { X 1 ( B ) , B ( ) } ) . Then real random variables X , Y are independent if and only if σ ( X ) , σ ( Y ) are independent σ -algebras.

7 Law of Large Numbers

Definition 7.1 (Stochastic process).

Let ( Ω , , ) be a probability space. A stochastic or random process is a family ( X t : t T ) of real random variables, indexed by an abstract set T .

  • Discrete time: T = .

  • Continuous time: T = 0 .

  • Space: T = n , ”random fields”.

Key models covered in this note: Independent sum; discrete-time martingales.

Definition 7.2 (Independent sum).

Let ( Y i : i ) be an independent sequence of real random variables.

The partial sums:

X n := i = 1 m Y i (73)

compose a discrete-time ( T = + ) random process, called an independent sum process.

Examples.

  • Repeated independent experiments.

  • Random walk.

  • Renewal processes.

Definition 7.3 (Running average process).

Let ( Y i : i ) be independent real random variables. X ¯ n = 1 n i = 1 n Y u is called the running average process.

𝔼 [ X ¯ n ] = 𝔼 [ Y ] (74)
V a r ( X ¯ n ) = 1 n V a r [ Y ] . (75)

7.1 Weak Law of Large Numbers

Definition 7.4 (Convergence in probability).

A sequence ( W n : n ) of real random variables converges in probability to a random variable W , if

sup t > 0 lim n { | W n W | t } = 0 , t > 0 . (76)
Theorem 7.5 (Chebyshev’s weak LLN).

Let Y L 2 be a real random variable and ( Y i : i ) are iid copies of Y . Let X ¯ n = i = 1 n Y i . Then

X ¯ n 𝔼 Y in probability (77)

7.2 Strong Law of Large Numbers

Definition 7.6 (A.s. Convergence).

Consider a sequence ( W n : n ) of real random variables. We say that W n W almost surely, if when n ,

{ w Ω : W n ( ω ) w ( ω ) } = 1 . (78)

Equivalently,

{ lim sup n | W n W | > 0 } = 0 . (79)

Remark.

W n W point-wise W n W a . s . W n W  in . (80)
Theorem 7.7 (Kolmogorov Strong LLN).

Let Y L 1 be a real random variable. Let ( Y i : i ) be iid copies of Y . Let X ¯ n = 1 n i = 1 n Y i . Then X ¯ n 𝔼 Y a.s. That is,

{ X ¯ n 𝔼 Y } = 1 . (81)
Theorem 7.8 (Cantelli Strong LLN).

Assume Y L 4 . Then the previous theorem has a simpler proof : )

Proof.

WLOG, assume 𝔼 Y = 0 .

Lemma 1 (Borel Cantelli.).

Let ( A n : n ) be events.

n = 1 ( A n ) < 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 ( lim sup n A n ) = 0 (82)
Proof of the lemma..
lim sup n A n = n = 1 i n A i = event that A i  happens infinite times (83)
( lim sup n A n ) ( i n A i ) , n (84)
i n ( A i ) = 0 , as n . (85)

Lemma 2 (Cartelli tail bound.).

Assume 𝔼 Y = 0 , μ = 𝔼 | Y | 4 . Then n ,

{ | X ¯ n | n 1 / 8 } 3 μ n 3 / 2 . (86)

Cont. Define event A n = { ω Ω : | X ¯ n ( ω ) | n 1 / 8 } . Then

n = 1 ( A n ) n = 1 3 μ n 3 / 2 < . (87)

Then using the Borel-Cantelli lemma.

{ lim sup n | X n | > 0 } = 0 . (88)

7.3 Concentration

The law of large numbers gives an asymptotic analysis for the limit of X n = 1 n i = 1 n Y i . However, we also want nonasymptotic analysis for a fixed n .

Proposition 7.9 (Chebyshev).

Let X L 2 , and σ = V a r ( X ) ,

{ | X 𝔼 X | σ t } 1 t 2 , t > 0 . (89)
  • Chebyshev’s inequality gives weak controls on tail bound ( t 2 ), but this is the best we could get by assuming X L 2 only.

Definition 7.10 (Moment generating function (mgf) and cumulant generating function (cgf)).

Let X be a real random variable. The moment generating function is defined as

m X ( θ ) := 𝔼 [ e θ X ] , (90)

and the cumulant generating function

ξ X ( θ ) := log 𝔼 [ e θ X ] . (91)
Proposition 7.11 (Laplace transform method).

Let X be a real random variable. For all t ,

{ X t } inf θ > 0 exp ( θ t + ξ X ( θ ) ) (92)
{ X t } inf θ < 0 exp ( θ t + ξ X ( θ ) ) . (93)
Proof Sketch..

Simply apply Markov’s inequality to exp ( θ X ) . ∎

Example. (Normal distribution) Recall that for Z 𝒩 ( 0 , σ 2 ) , we have ξ Z ( θ ) = σ 2 θ 2 2 . Then

𝔼 { Z t } inf θ > 0 exp ( θ t + σ 2 θ 2 / 2 ) = exp ( t 2 / 2 σ 2 ) . (94)
Proposition 7.12 (Additivity of cgf).

Let X = i Y i for independent random variables Y i , then

ξ X ( θ ) = i ξ Y i ( θ ) . (95)

Thanks to the additivity of cgf, we can easily derive a tail bound for independent sums,

Theorem 7.13.

Let X = i = 1 n Y i for independent Y i .

{ X t } inf θ > 0 exp ( θ t + i = 1 n ξ Y i ( θ ) ) (96)
{ X t } inf θ < 0 exp ( θ t + i = 1 n ξ Y i ( θ ) ) (97)

Example. (Binomial) Let X B i n o m i a l ( n , p ) , so X = i = 1 n Y i where Y i B e r n ( p ) iid.

ξ Y i ( θ ) p ( e θ 1 ) . (98)

So we get (the Chernoff bound)

{ X t 𝔼 [ X ] } inf θ > 0 exp ( n p ( θ t e θ + 1 ) ) = exp ( n p ( t log t t + 1 ) ) = ( e t 1 t t ) n p . (99)
Theorem 7.14 (Hoeffding’s theorem).

For X = i = 1 n Y i for independent bounded random variables Y i [ a i , b i ] . Let σ = i = 1 n ( b i a i ) 2 / 2 . Then

{ | X 𝔼 X | σ t } 2 exp ( t 2 / 2 ) (100)
Proof.

We need a cgf bound for Y i . We find that ξ Y i ( θ ) ( b i a i ) 2 θ 2 / 8 . This is because

m Y i ( θ ) = 𝔼 [ e θ Y ] (101)
𝔼 [ f ( x ) ] = cosh ( θ b a 2 ) (102)

where f ( x ) = ( e θ b e θ a ) ( x a ) 2 ( b a ) + e θ a so f ( x ) < e θ x on [ a , b ] . Plug in ξ Y i = log m Y i proves the theorem. ∎

7.4 Weak convergence

Limit theorems in probability tell us when a sequence of random variables converges to a limiting random variable. In this section, we consider describing when the distributions of the random variables converge to a limiting distribution.

What does it mean for the distributions of random variables to converge?

Idea: Moments carry information about distributions.

μ ( h ) = h ( x ) μ ( d x ) (103)

If two measures μ , ν are close, then many moments are similar: μ ( h ) ν ( h ) for “many” h .

Definition 7.15.

A function h : is bounded Lipschitz, if

h B L = max { h s u p , h L i p } < , (104)

where

h s u p = s u p { | h ( x ) | : x } (105)

and

h L i p = inf L > 0 : | h ( x ) h ( y ) | L | x y | , x , y (106)

Remark. B L is a norm.

Proposition 7.16 (BL functions separate measures.).

Let μ , ν be Borel probability measures on .

Then μ = ν if and only if μ ( h ) = ν ( h ) for all bounded Lipschitz h : .

Proof.

The forward direction is trivial.

The reverse direction. Assume μ ν , then the cdf F μ F ν , so there exists a ,

F μ ( a ) F ν ( a ) . (107)

Let us consider a series of BL functions:

h n ( x ) = { 1 , x a 0 , x a + 1 n 1 n ( x a ) , x ( a , a + 1 n . (108)

Using bounded convergence theorem (or dominated convergence), μ ( h n ) F μ ( a ) and ν ( h n ) F ν ( a ) . Therefore, there exists n , such that μ ( h n ) ν ( h n ) . ∎

Definition 7.17 (Bounded Lipschitz metric).

Let μ , ν be Borel probability measures on . Define

d B L ( μ , ν ) := sup { | μ ( h ) ν ( h ) | : h B L 1 } . (109)

Note: d B L is a metric on the probability measures on , i.e.,

  • d B L ( μ , ν ) = 0 if and only if μ = ν .

  • d B L ( μ , ν ) = d B L ( ν , μ ) .

  • d B L ( μ , ν ) d B L ( μ , ρ ) + d B L ( ρ , ν ) .

Definition 7.18 (Weak convergence).

Let ( μ n : n ) be Borel probability measures on , μ be a Borel probability. We say μ converges weakly to μ , if

d B L ( μ n , μ ) 0 𝑎𝑠 n . (110)

Let ( X n : n ) be real random variables and X be a real random variable. Then we say X n weakly converges to X , if μ X n weakly converges to μ X .

Theorem 7.19.

Let ( μ n : n ) , μ be Borel probability measures on . The following are equivalent:

  • μ n weakly converges to μ

  • μ n ( h ) μ ( h ) for all bounded Lipschitz h : .

  • μ n ( h ) μ ( h ) for all bounded continuous h : .

  • The distribution function F μ n ( a ) converges to F μ ( a ) for all a where F μ ( a ) us continuous.

  • The characteristic function χ μ n χ μ point-wise, where

    χ W ( θ ) = 𝔼 [ e i θ W ] (111)
Proposition 7.20.

X n 𝔼 Y almost sure implies that X n converges weakly to 𝔼 Y .

Definition 7.21 (Integral probability metric).

Let H be a collection of functions from . Define the IPM related to H by

d H ( μ , ν ) = sup { | μ ( h ) ν ( h ) | : h H } , (112)

for all Borel probability measure ν on .

Remark.

  • Different H gives different notions of distance

  • If H H , then d H d H

  • d H is a metric if and only if H separates measures.

Examples

  • Kolmogorov: H = { 𝟙 ( , a ] : a } , which induces uniform convergence of cdfs.

  • Total variation: H = { h : : h sup 1 , h continuous } or H = { 𝟙 B : B Borel } .

  • Kantorovich Wasserstein-1 distance

    H = { h : : h L i p 1 } (113)

    Convergence weakly and L 1 .

7.5 Central Limit Theorem

Let ( Y i : i ) be iid copies of Y L 2 , then X ¯ n = 1 n i = 1 n Y i converges almost surely to 𝔼 Y .

Idea: rescale X ¯ n to make its variance nontrivial. We define standardized sums

T n := n X ¯ n 𝔼 Y V a r ( Y ) . (114)

It can be checked that 𝔼 T n = 0 and V a r ( T n ) = 1 .

Theorem 7.22 (Central Limit Theorem).

Let ( Y i : i ) be iid copies of Y L 2 . Define

T n = n ( X ¯ n 𝔼 Y V a r ( Y ) ) n . (115)

Then T n weakly converges to 𝒩 ( 0 , 1 ) as n .

Remark.

  • Historically, the central limit theorem is used to support asymptotic confidence intervals for 𝔼 Y .

  • Equivalent statements:

    d B L ( T n , Z ) = sup h B L 1 | 𝔼 h ( T n ) 𝔼 h ( Z ) | . (116)

    Also equivalent to

    { T n a } = F T n ( a ) Φ ( a ) , a , (117)

    which is often called the convergence in distribution.

7.6 Nonasymptotic counterpart

Theorem 7.23 (Berry-Esseen).

Let ( Y i : i ) be iid copies of Y L 3 . Define

σ 2 = V a r ( Y ) , 𝑎𝑛𝑑 M 3 = 𝔼 | Y 𝔼 Y | 3 . (118)

Let Y n be the standardized sum, then the Kolmogorov distance

d K o l ( T n , Z ) := sup a | F T n ( a ) Φ ( a ) | M 3 n σ 3 , n . (119)
Proof of a weaker version by the Lindeberg exchange principle..

The idea is to show for smooth functions h : ,

𝔼 h ( T n ) = 𝔼 h ( Z ) , by Taylor expansion. (120)
Lemma 3.

Let ( Y , Z ) be L 3 random variables with 𝔼 Y = 𝔼 Z and 𝔼 Y 2 = 𝔼 Z 2 . Let h : with h ( 3 ) sup < .

𝔼 h ( Y ) 𝔼 h ( Z ) 1 6 h ( 3 ) sup ( 𝔼 | Y | 3 + 𝔼 | Z | 3 ) . (121)
Proof of Lemma..
h ( t ) h ( 0 ) t h ( 0 ) t 2 2 h ′′ ( 0 ) = t 3 6 h ( 3 ) ( ξ ) , for some ξ [ 0 , t ] . (122)

Then

| 𝔼 [ h ( Y ) h ( 0 ) Y h ( 0 ) 1 2 Y 2 h ′′ ( 0 ) ] | 1 6 𝔼 | Y 3 | h ( 3 ) sup . (123)

The left hand side is1 the same for Y and Z , except for the first term, so

| 𝔼 h ( Y ) 𝔼 h ( Z ) | 1 6 h ( 3 ) sup ( 𝔼 | Y | 3 + 𝔼 | Z | 3 ) . (124)

Let Y 1 , Y n and Z 1 , , Z n be independent with 𝔼 Y i = 𝔼 Z i and 𝔼 Y i 2 = 𝔼 Z i 2 . For f : n , such that i i i f sup < , then

| 𝔼 f ( Y 1 , , Y n ) 𝔼 f ( Z 1 , , Z n ) | 1 6 i = 1 n i i i f sup ( 𝔼 | Y i | 3 + 𝔼 | Z i | 3 ) . (125)

This is called the Lindeberg exchange theorem. To see this, let

W i = ( Y 1 , , Y i , Z i + 1 , , Z n ) . (126)

Then since W 0 = ( Z 1 , , Z n ) and W n = ( Y 1 , , Y n ) ,

| 𝔼 f ( Y ) 𝔼 f ( Z ) | i = 1 n | 𝔼 f ( W i ) 𝔼 f ( W i + 1 ) | 1 6 i = 1 n i i i f sup ( 𝔼 | Y i | 3 + 𝔼 | Z i | 3 ) . (127)

Let Y i Y , Z i Z in L 3 be independent random variables that are already standardized, i.e., 𝔼 Y = = 0 and V a r ( Y ) = V a r ( Z ) = 1 . Let M = max { 𝔼 | Y | 3 , 𝔼 | Z | 3 } < . Let h : be smooth. Define

f ( X 1 , , X n ) = h ( 1 n i = 1 n X i ) , where X means Y , Z . (128)

i i i f sup n 3 / 2 h ( 3 ) sup . Then applying the Lindeberg exchange principle,

| 𝔼 h ( 1 n i = 1 n Y i ) 𝔼 h ( 1 n i = 1 n Z i ) | M 3 n h ( 3 ) sup . (129)

Let

H = { h : , h ( 3 ) sup 1 } (130)
d H ( 1 n i Y i , 1 n i Z i ) M 3 n . (131)

Issues:

  • Where is the normal random variable?

    Solution: Choose Z i 𝒩 ( 0 , 1 ) , then we can compute 1 n i = 1 n Z i 𝒩 ( 0 , 1 ) .

  • What is d H ?

    Solution: For each h : with h B L 1 , we can smooth h to get a function h σ such that h σ ( 3 ) sup is bounded and h h σ sup is bounded, which implies

    d B L ( T n , Y ) C 𝔼 ( | Y | 3 ) 1 / 3 n 1 / 6 , as n . (132)

8 Conditional Expectation

If two random variables are dependent, by observing one, we can update our knowledge about probabilities of events involving the other. In particular, we can update our “best guess” for the expectation.

Fix a probability space ( Ω , , ) . Let X L 2 ( Ω . , ) .

Recall that V a r ( X ) = 𝔼 [ ( X 𝔼 X ) 2 ] = inf a X a 2 .

8.1 Conditional expectation in least squares

Consider a subspace in L 2 :

K 0 = { Y L 2 ( Ω , , ) : Y ( ω ) = a , ω Ω , a } . (133)

K 0 is a complete subspace in L 2 , so there is always a projection from L 2 to K 0 , i.e.,

Y * = arg min Y K 0 𝔼 ( X Y ) 2 (134)

Now suppose we observe a random variable Z L 2 and want to update our best guess of 𝔼 X . We first try an affine approximation:

min Y X Y 2 , (135)

where Y K 1 .

K 1 = { a + b Z : a , b } . (136)

We can compute the optimal a , b given V a r ( X ) , V a r ( Z ) and C o v ( X , Z ) .

As K 0 K 1 , the error is no worse than Y = 𝔼 X .

However, this affine function of Z is suboptimal (consider X = Z 2 ). In general, we have to consider a nonlinear function of Z .

Consider K Z = { h ( Z ) L 2 , h :  measurable } .

min X Y 2 , s . t . Y K Z . (137)

Recall that

Lemma 4.

Y is σ ( Z ) -measurable, if and only if Y = h ( Z ) for some h : measurable.

As a result, K Z is a complete subspace of L 2 , since

K Z = L 2 ( Ω , σ ( Z ) , ) . (138)

Therefore, there exists an orthogonal projection of X to K Z . Moreover, it is unique almost sure.

We define this projection by 𝔼 [ X | Z ] .

8.2 Conditioning on a σ -algebra

Consider conditioning on a σ -algebra: 𝒢 . For each event G 𝒢 , we can decide if ω G or not.

Definition 8.1 (Conditional expectation in L 2 ).

Fix ( Ω , , ) . Let 𝒢 be a sub- σ -algebra on Ω .

For X L 2 ( Ω , , ) ,

min Y X Y 2 , s . t . Y L 2 ( Ω , 𝒢 , | 𝒢 ) . (139)

A solution Y to this least square problem is the conditional expectation of X , given 𝒢 , denoted as Y = 𝔼 [ X | 𝒢 ] .

Dual Characterization. ( X Y ) Z for all Z L 2 ( Ω , 𝒢 , 𝒢 ) .

Remarks.

  • Y = 𝔼 [ X | 𝒢 ] is a random variable on the sample space. Value is determined, once given 𝟙 G for all G 𝒢 .

  • For random variables Z , 𝔼 [ X | Z ] := 𝔼 [ X | σ ( Z ) ] .

  • For an event E , 𝔼 [ X | E ] := 𝔼 [ X | σ ( E ) ] . The conditional expectation must be Y = a 𝟙 E + b 𝟙 E C .

    The optimal value of a , b is a = 𝔼 [ X 𝟙 E ] ( E ) and b = 𝔼 [ X 𝟙 E C ] [ E C ] .

8.3 Characteristic properties of conditional expectation

Let G 1 , , G n be a disjoint cover of Ω , 𝒢 = σ ( G 1 , G 2 , , G n ) . Let Y = 𝔼 [ X | 𝒢 ] .

Measurability: Y is a constant on each minimal event G i . Therefore, Y is 𝒢 -measurable, i.e., Y 1 ( B ) 𝒢 , for all Borel B .

Consistency: Average of Y on events in 𝒢 equals the average of X on the event.

𝔼 [ X 𝟙 G ] = 𝔼 [ Y 𝟙 G ] , G 𝒢 . (140)

This is also often called the coarse-graining property.

Proof.
X Y , W = 0 , W L 2 ( Ω , 𝒢 , ) . (141)

Then 𝔼 [ X W ] = 𝔼 [ Y W ] . Since 𝟙 G is measurable in 𝒢 , the statement is proved. ∎

8.4 Conditional expectation in L 1

So far we have only defined conditional expectation in L 2 . How can we make it work if it only belongs to L 1 ?

Definition 8.2 (Kolmogorov, 1933).

Let ( Ω , , ) be a probability space. Let 𝒢 be a σ -algebra on Ω . Let X L 1 ( Ω , , ) . A real random variable Y is a version of 𝔼 [ X | 𝒢 ] if

  • Y is integrable: Y L 1 ( Ω , , )

  • Y is 𝒢 -measurable

  • Consistency: 𝔼 [ Y 𝟙 G ] = 𝔼 [ X 𝟙 G ] for all G 𝒢 .

Theorem 8.3 (Fundamental theorem of conditional expectation).

There exists a version Y of the conditional expectation 𝔼 [ X | 𝒢 ] . If Y is a another version, then [ Y = Y ] = 1 .

Proof Sketch..

Existence. WLOG assume X 0 . Let X n ( ω ) = min { X ( ω ) , n } .

X n is positive, bounded, -measurable, and mono-increasingly converges to X point-wise.

X n bounded implies that X n L 2 , so we can compute its conditional expectation Y n = 𝔼 [ X n | 𝒢 ] . We can check that Y 1 Y 2 Y n . Define

Y ( ω ) = lim sup n Y n ( ω )  for ω Ω . (142)

By monotone convergence theorem, 𝔼 [ Y ] = lim n 𝔼 [ Y n ] = lim n 𝔼 [ X n ] = 𝔼 [ X ] < . Following a similar argument, 𝔼 [ Y 𝟙 G ] = 𝔼 [ X 𝟙 G ] for all G 𝒢 .

Moreover, Y = lim sup n Y n is 𝒢 -measurable, since Y n are measurable.

Uniqueness. Assume Y , Y are both versions of conditional expectations. By consistency 𝔼 [ ( Y Y ) 𝟙 G ] = 0 for all G 𝒢 . For contradiction, assume { Y > Y } > 0 .

Let E n = { Y > Y + 1 / n } , which are 𝒢 -measurable events. E n mono-increasing converges to E = { Y > Y } . This means ( E n ) mono-increasingly converges to { Y > Y } > 0 . Therefore, there exists n , such that 𝔼 [ ( Y Y ) E n ] > 1 n ( E n ) > 0 , contradiction. ∎

Proposition 8.4 (Conditional expectation is an expectation.).

Conditional expectation satisfies the same properties as expectation.

  • Unital and positive.

  • Conditional expectation is linear.

  • Monotonic. If X Y a.s., then 𝔼 [ X | 𝒢 ] 𝔼 [ Y | 𝒢 ] a.s.

  • Jenson’s inequality holds: 𝔼 [ φ ( x ) | 𝒢 ] φ ( 𝔼 [ X | 𝒢 ] ) a.s for convex φ .

8.5 Convergence theorems

Theorem 8.5 (Conditional Monotonic convergence).

Suppose 0 X n X a.s. where X L 1 . Then

𝔼 [ X n | 𝒢 ] 𝔼 [ X | 𝒢 ] a . s . (143)
Proposition 8.6.

Some properties for conditional expectation:

  • Expectation: If Y = 𝔼 [ X | 𝒢 ] , then 𝔼 [ X ] = 𝔼 [ Y ]

  • Full knowledge: If X is 𝒢 -measurable, then 𝔼 [ X | 𝒢 ] = X a.s.

  • Independence: If σ ( X ) independence of 𝒢 , then 𝔼 [ X | 𝒢 ] , then 𝔼 [ X | 𝒢 ] = 𝔼 [ X ] a.s.

  • Pull-through: If Z is bounded and 𝒢 -measurable, then 𝔼 [ X Z | 𝒢 ] = 𝔼 [ X | 𝒢 ] Z , a.s.

  • Tower: If 𝒢 are increasing σ -algebras. Then

    𝔼 [ 𝔼 [ X | ] | 𝒢 ] = 𝔼 [ X | 𝒢 ] , a . s . (144)

Conditional integration:

𝔼 [ h ( X ) | 𝒢 ] ( ω ) = h ( x ) μ X | 𝒢 ( d x | ω ) , a . s . (145)

8.6 Gaussian and conditioning

Definition 8.7 (Multivariate normal random variable).

𝐗 = ( X 1 , , X n ) is a multivariate normal random vector if it is an affine function of a standard normal vector 𝐙 = ( 𝐙 𝟏 , , 𝐙 𝐧 ) .

𝐗 = m + Σ 𝐙 , (146)

where m n and Σ n × n .

Remark.

  • 𝔼 [ X i ] = m i , C o v ( X i , X j ) = C i j , where C = Σ Σ * 0 .

  • We denote 𝐗 𝒩 ( m , C ) , as m , C uniquely determine a multivariate normal random variable.

Definition 8.8 (Multivariate characteristic function).

Let 𝐗 be a random vector on n . The characteristic function of 𝐗 is χ : n

χ 𝐗 ( θ ) := 𝔼 [ e i θ T 𝐗 ] (147)

We can check that if 𝐘 = A 𝐗 + b , then

χ 𝐘 ( θ ) = e i θ T b χ 𝐗 ( A * θ ) . (148)

If 𝐗 , 𝐘 are independent random vectors on n , then

χ 𝐗 + 𝐘 ( θ ) = χ 𝐗 ( θ ) χ 𝐘 ( θ ) . (149)

It is easy to check that for 𝐗 = m + Σ 𝐙 ,

χ 𝐗 ( θ ) = e i θ T m e ( Σ * θ 2 2 / 2 ) = exp ( i θ T m θ T C θ 2 ) . (150)
Theorem 8.9.

Suppose X are random vectors taking values in n , then the laws of X and Y are the same if and only if the characteristic functions are the same.

Theorem 8.10 (Linear marginals).

A random vector X on n is normal if and only if a , X is a real normal random variable for every a n .

8.6.1 Gaussian conditioning

Let ( X , 𝐘 ) n + 1 be multivariate normal distribution. 𝐘 = ( Y 1 , , Y n ) . Assume 𝔼 [ X ] = 0 and 𝔼 [ 𝐘 ] = 0 .

We want to find the best approximation of X as a linear function of Y :

min a n X a , 𝐘 2 . (151)

We define c X X = V a r ( X ) , c X Y = C o v ( X , 𝐘 ) n , and C Y Y = C o v ( 𝐘 , 𝐘 ) n × n .

Then

𝔼 [ ( X a , 𝐘 ) 2 ] = c X X 2 a , c X Y + a T C Y Y a . (152)

By taking the derivative with respect to a ,

a = C Y Y 1 c X Y (153)

The best approximation is thus

X ^ = a , 𝐘 = c X Y T C Y Y 1 𝐘 . (154)

This means that

X X ^ s p a n ( Y 1 , , Y n ) . (155)

Then, C o v ( X X ^ , Y ) = 0 . Specifically, for normal distributions, uncorrelation means independence (which can be shown using characteristic functions as normal random variables are determined by covariance).

Theorem 8.11 (Gaussian conditional expectation.).

Different from general random variables, linear functions a , Y gives full information for conditioning.

𝔼 [ X | Y ] = X ^ = c X Y C Y Y 1 Y . (156)
Proof.
𝔼 [ X | Y ] = 𝔼 [ ( X X ^ ) + X ^ | Y ]
= 𝔼 [ X X ^ | Y ] + 𝔼 [ X ^ | Y ] .
= 𝔼 [ X X ^ ] + X ^ . (independence and full information)
= X ^ .

9 Martingales

Recall that a σ -algebra captures information about the world. A bigger σ -algebra includes more knowledge.

A filtration is a mathematical model for accumulating information.

9.1 Filtration

Definition 9.1 (Filtration).

Let ( Ω , , ) be a probability space. A (discrete-time) filtration is a sequence of increasing σ -algebras on Ω :

0 1 . (157)

Commonly, define 0 = { , Ω } , = σ ( k = 1 k ) .

Examples.

Let Z 0 , Z 1 , , Z n be real random variables, and k = σ ( Z 0 , , Z k ) . Then { k } is a filtration.

Definition 9.2 (Adapted process).

A sequence ( X 0 , X 1 , ) of real random variables is adapted to the filtration ( k : k + ) if each X k is k -measurable.

In other words, at time k , the value of X k is determined by the information we have.

9.2 Defining Martingales

Definition 9.3 (Martingales (Informal)).

A martingale is an adapted process that is indifferent about the future.

That is, for n k , 𝔼 [ X n | k ] = X k .

Definition 9.4 (Martingales (Formal)).

Let ( Ω , , ) be a probability space, and a filtration 0 1 . An sequence of real random variables ( X i : i + ) is a martingale if

  • Adapted: ( X k ) is adapted to the filtration ( k )

  • Integrable: 𝔼 | X k | < , for each k .

  • Status Quo: 𝔼 [ X k + 1 | k ] = X k almost sure for each k .

We can relax the third property to 𝔼 [ X k + 1 | k ] X k a.s., in which ( X k ) is called a supermartingale.

𝔼 [ X k + 1 | k ] X k a.s. is called a submartingale.

This seemingly awkward terminology comes from the super/sub-harmonic function of Markov chains.

Examples. Independent sums of centered random variables, Random walks, Levy-Doob process.

9.3 Formal model for Gambling strategies

Definition 9.5 (Previsible process.).

A sequence of real random variables ( C 1 , C 2 , , ) is a previsible with respect to the filtration ( k ) if c k is k 1 -measurable for all k .

When you play a game of chance, you bet before the game.

Definition 9.6 (Martingale transform).

Let ( X k ) be a martingale and ( C k : k ) be a previsible sequence. The martingale transform is the sequence

( C X ) k = i = 1 k C i ( X i X i 1 ) , 𝑓𝑜𝑟 k = 1 , 2 , . (158)

Remark. You bet C i units of capital on a fair game (martingale). Then you win C i ( X i X i 1 ) on the i -th day. The total winnings after k plays is

( C X ) k = i = 1 k C i ( X i X i 1 ) . (159)
Proposition 9.7 (Martingale transform.).

Let ( X k : k + ) be a martingale and ( C k : k ) be a previsible process that are bounded. Then

( ( C X ) k : k + ) (160)

is a null martingale, i.e., ( C X ) 0 = 0 .

Proof.

Pull-through property:

𝔼 [ ( C X ) k + 1 ( C X ) k | k ] = 𝔼 [ C k + 1 ( X k + 1 X k ) | k ] = C k + 1 𝔼 [ X k + 1 X k | k ] = 0 . (161)

Since ( C X ) k is k measurable,

𝔼 [ ( C X ) k + 1 | k ] = ( C X ) k . (162)

We interpret this proposition: No sequence of bounded bets allows you to beat a fair game.

Another question is: can you quit the game when you are ahead to get advantages in a fair game?

Definition 9.8 (Stopping times).

A random variable τ : Ω + { + } is a stopping time, if { τ k } is k measurable for each k + .

Remark. The meaning of this definition is that after game k , you have enough information to decide whether the stopping criterion is triggered.

Definition 9.9 (Stopped process).

Let ( X k ) be an adapted process and τ be a stopping time. The stopped process is the sequence

( X k τ : k + ) , 𝑤ℎ𝑒𝑟𝑒 k τ = min { k , τ } . (163)

The idea of this definition is that your winnings freeze after you stop gambling.

Definition 9.10 (Stopped martingales.).

If ( X k ) is a martingale, τ is the stopping time, then the stopped process ( X k τ ) is a martingale.

Remark. This means you cannot gain an advantage in a fair game by quitting strategically.

Proof.

Consider a previsible process ( C k : k ) ,

C k = { 1 , k τ , 0 , k > τ . (164)

Since C k is previsible and bounded, the martingale ( C X ) k is a martingale with an initial value 0. This means

( C X ) k = i = 1 k C i ( X i X i 1 ) = i = 1 k τ ( X i X i 1 ) = X k τ X 0 . (165)

Therefore,

𝔼 [ X k τ | k ] = X 0 . (166)

Warning: It is not true that 𝔼 [ X τ ] = 𝔼 [ X 0 ] as τ is a random variable dependent on the outcomes of ( X k ) . For example, we stop when X k 1 , then 𝔼 [ X τ ] 1 .

However, with the following theorem, we circumvent this paradox by restricting the horizon of our game plays.

Theorem 9.11 (Optional stopping).

Let ( X k ) be a martingale. and τ be a stopping time. If τ B for a fixed B almost surely, then

𝔼 [ X τ ] = 𝔼 [ X 0 ] . (167)
Proof.

By the proposition on the stopped martingales,

𝔼 [ X 0 ] = 𝔼 [ X k τ ] for each k (168)

Choose k = B , so

𝔼 [ X 0 ] = 𝔼 [ X B τ ] = 𝔼 [ X τ ] . (169)

9.4 Convergence of martingales

Martingales are flavored by probability theorists because they converge to a stable equilibrium under weak assumptions.

Theorem 9.12 (Doob’s convergence theorem).

Let ( X k : k + ) be a martingale sequence that is uniformly bounded in L 1 , i.e., there exists a positive R , such that 𝔼 [ | X k | ] R for all k + .

Then almost surely, X ( ω ) := lim k X k ( ω ) exists, and is finite. That is, X k converges to X almost surely.

Remark. It is not true in general that 𝔼 [ X k ] converges to 𝔼 [ X ] .

Lemma 5 (Interval sandwich).

A nonrandom real-valued sequence ( x k ) fails to converge if and only if

lim inf k x k < a < b < lim sup k x k , (170)

for some a , b .

We define “upcrossing” of a sequence to an interval.

Definition 9.13.

Let ( x k ) be a nonrandom real sequence. Fix a < b . The number u N a , b of upcrossings before time N is the largest m , such that

0 s 1 t 1 < s 2 < t 2 < < s m < t m N , (171)

where x s i < a and x t i > b .

The total number of upcrossings:

u [ a , b ] = lim N u N [ a , b ] . (172)
Lemma 6.

Equation 171 implies u [ a , b ] = . If u [ a , b ] is finite for all rational pairs ( a , b ) with a < b , then ( x k ) converges in .

9.4.1 Proof of Doob’s convergence

Idea: if a martingale crosses an interval [ a , b ] infinitely times, we can make money by betting on the upcrossings. This is impossible.

Specifically, we keep betting below b , and start betting when x k gets below a . We formalize it as below.

Fix a < b . Let c 1 = 𝟙 { x 0 < a } .

Let c k = 𝟙 { c k 1 = 1 , x k 1 b } + 𝟙 { c k 1 = 0 , x k 1 < a } .

We can check that ( c k ) is positive, bounded, and previsible. Then the martingale transform Y k = ( C X ) k is a null-martingale.

Define U N [ a , b ] ( ω ) as the number of upcrossings of [ a , b ] up to time N by the sample path X k ( ω ) , and U [ a , b ] ( ω ) = lim N U N [ a , b ] ( ω ) .

Note that

Y N ( ω ) ( b a ) U N [ a , b ] ( ω ) ( X N ( ω ) a ) (173)

Since 𝔼 [ Y n ] = 0 , so we have the following proposition:

Proposition 9.14 (Snell upcrossing inequality).

Let ( X k ) be a martingale and a < b be real numbers.

( b a ) 𝔼 [ U N [ a , b ] ] 𝔼 [ ( X N a ) ] (174)

As 𝔼 [ | X N | ] R , then for any a < b ,

{ U [ a , b ] = } = 0 . (175)

Define an event E = { ω : X k ( ω ) does not converge in  } ,

E = a < b , a , b { ω : lim inf x k < a < b < lim sup x k } a < b , a , b { ω : U [ a , b ] = } . (176)

Therefore, ( E ) , because rational number set is countable. Thus, X ( ω ) = lim k X k ( ω ) with probability 1. Moreover,

𝔼 | X | = 𝔼 [ lim inf k | X k | ] lim inf k 𝔼 | X k | R . (177)

9.5 Concentration inequalities for martingales

What is the probability that a martingale ever deviates much from its mean value?

Begin with submartingales: 𝔼 [ X k + 1 | k ] X k a.s.

Theorem 9.15 (Doob’s maximal inequality for submartingales).

Consider a positive submartingale ( X k : k + ) . For each N + ,

{ sup 0 k N X k > t } 𝔼 X N t . (178)

Moreover, if X n X in L 1 , then it’s true for N = .

Remark. Though this statement looks like Markov’s inequality, we cannot derive it directly from Markov’s inequality.

Proof.

We want to bound the probability that the martingale escapes a band. We define the stopping time τ : Ω + { } ,

τ := inf { k N : X k > t } . (179)

Therefore, the event

E := { sup k N X k > t } = { τ < } . (180)

Then,

𝔼 X N 𝔼 X N τ
𝔼 [ X N τ 𝟙 E ]
= 𝔼 [ X τ 𝟙 E ]
> t [ E ] (181)

Therefore,

( sup k N X k > t ) 𝔼 [ X N ] t . (182)

Moreover, E N := { sup k N X k > t } mono-increasingly converges to E , and 𝔼 X N 𝔼 [ X ] by L 1 convergence. By dominated convergence theorem the statement holds for N = . ∎

Proposition 9.16 (Martingale: convex transformation.).

Consider a martingale ( M k : k + ) . For any convex function φ : , the sequence

X k := φ ( M k ) (183)

is a submartingale, provided that 𝔼 | X k | < .

Proof.

By conditional Jensen’s inequality. ∎

We can then generalize Chebyshev’s inequality to Martingales,

Theorem 9.17 (Kolmogorov’s inequality).

Let ( M k : k + ) be a martingale in L 2 . For N + ,

{ max k N ( X k 𝔼 X k ) 2 > t } V a r [ X N ] t . (184)
Proof.

The sequence ( X k 𝔼 X k ) is an L 2 martingale, so ( X k 𝔼 X k ) 2 is a submartingale, due to Proposition 9.16. Using Doob’s maximal inequality the statement follows. ∎

Remark. Define Δ k = X k X k 1 . Then

V a r [ X n ] = k = 1 N 𝔼 [ Δ k 2 ] . (185)
Proposition 9.18 (Exponential maximal inequality).

Let ( X k : k + ) be a bounded martingale. For N + ,

{ max k N X k > t } inf θ > 0 exp ( θ t + ξ X N ( θ ) ) , (186)

where ξ X ( θ ) = log 𝔼 e θ X is the cgf of X .

Theorem 9.19 (Azuma-Hoeffding: uniform version).

Consider a martingale ( X k : k + ) whose difference sequence Δ k := X k X k 1 is bounded, | Δ k | a k . Let the variance proxy ν N := k = 1 N a k 2 .

{ max k N | X k X 0 | > t } 2 exp ( t 2 / 2 ν N ) . (187)
Proof.

Hoeffding lemma: If Y is a real random variable with 𝔼 Y = 0 , | Y | a , then its mgf ξ Y ( θ ) 1 2 θ 2 a 2 . As X N X 0 = k = 1 N Δ k ,

m X N X 0 ( θ ) = 𝔼 [ e θ Δ N e θ Δ 1 ]
= 𝔼 [ 𝔼 [ e θ Δ N | N 1 ] e θ Δ N 1 e θ Δ 1 ]
e θ 2 Δ N / 2 𝔼 [ e θ Δ N 1 e θ Δ 1 ]
e θ 2 ν N / 2 . (188)

Therefore,

ξ X N X 0 ( θ ) 1 2 θ 2 ν N . (189)

Using this result with θ = t / ν n in the Exponential maximal inequality proves the theorem. ∎

Theorem 9.20 (Ville’s inequality).

Let ( X k : k + ) be a positive supermartingale. Then

{ sup k + X k > t } 𝔼 X 0 t . (190)
Wenda Chu 储闻达
Wenda Chu 储闻达
PhD Student