Probability Theory
Probability: Study of regular patterns that arise from random phenomena
Statistics: The inverse problem of making inferences from observations of patterns in random phenomena.
1 Preliminary: Measure Theory
Measures are defined to characterize the ”mass” of a set, which is crucial in defining the Lebesgue integral, probability measure, etc. In order to define measures, we need first assume some structures on the space of sets, i.e., -Algebra.
1.1 -Algebra
Definition 1.1 (-Algebra).
A -algebra on is a family of subsets of , , such that,
-
•
-
•
If , then
-
•
Let be a countable set. If , for , then and .
Definition 1.2 (Borel -Algebra).
Borel -Algebra is the -Algebra generated by open sets in .
(1) |
where is a topological space.
1.2 Measures
Definition 1.3 (Measures).
A measure is defined on a space with -Algebra , such that
-
•
.
-
•
, for disjoint sets .
This space is then called measurable space .
Definition 1.4 (Probability measure).
A probability measure on is a specific measure with .
Other measures:
-
•
Dirac measure:
(2) -
•
Counting measures
(3)
1.3 Some related definitions
Finite measure: If , then is finite.
-finite measure: If we can cover by countably many measurable sets each with finite measure, we call it a -finite measure.
A straightforward claim: finite measures are -finite measures. ”Length” on is a -finite measure but not finite.
Negligible: If a measurable set has then it is a negligible set for .
Almost everywhere: If a measurable set has then it is -almost everywhere, denoted as a.e.
1.4 Lebesgue measure
Borel measure: .
Lebesgue measures are defined on Borel -Algebra. For example, on :
(4) |
Intuition: to find the minimum union of intervals that covers . (Note: union of intervals on can always be written as a countable union of intervals).
-
•
It is a measure
-
•
, for .
-
•
Translation invariance: for all .
Lebesgue measure is the only Borel measure that satisfies the above properties.
Another example of Borel measure: is the cumulative distribution function, which is finite and monotone.
1.5 Integration
Riemann Integral. Let . We say is Riemann Integrable if the following limit exists:
(5) |
where is an arbitrary point in the interval .
However, Riemann integral may not exist. For example, for rational numbers in and otherwise. No matter how small an interval could be, it contains rational numbers and irrational numbers.
Definition 1.5 (Decreasing Rearrangement).
For a function , the dereasing rearrangement is defined as following:
(6) |
is a decreasing function, as measure is monotonic.
Remark. Given the definition of , our expectation is to define integration by:
(7) |
The intuition of this definition is: instead of counting the area by -axis, we count the area by -axis.
In order to ensure this integration works, we need to be measurable. The definition below of measurable function solves this problem.
Definition 1.6 ((Borel) Measurable function).
Let be a measurable space. A function is Borel-measurable if
(8) |
Proposition 1.7.
A function is Borel measurable if and only if
(9) |
for all Borel .
Properties of measurable functions:
-
•
If measurable, is measurable (for ),
-
•
is measurable,
-
•
are measurable,
-
•
is measurable,
-
•
is measurable.
-
•
If are measurable functions, are measurable.
Given the rigorous definition above, we can finally define Lebesgue integration. We first define it on positive functions and use it as a tool to define Lebesgue-integrable function in general.
Definition 1.8 (Lebesgue integral of positive function).
Let be a Borel measure. Let be a positive, measurable function. Then we define the Lebesgue integral of with respect to with
(10) |
Definition 1.9 (Lebesgue integrable function).
A finite valued measurable function is integrable with respect to Borel measure if
(11) |
Its Lebesgue integral is defined as
(12) |
notated by .
Proposition 1.10.
A bounded function over a bounded domain is a Riemann integrable function, then is Lebesgue integrable with respect to the Lebesgue measure .
(13) |
Proposition 1.11 (Monotone Convergence).
Let are positive, measurable functions. increasing in terms of , and for all . Then the integral
(14) |
2 Mathematical Foundation of Probability
Developed by Kolmogorov in 1933, an axiomatic foundation for the theory of probability.
Definition 2.1 (Probability Space, ANK 1933).
A triple , where
-
•
is a set called the sample space.
-
•
is a -Algebra of .
-
•
is a probability measure on assigning probability to events.
Notes:
-
•
are called sample points, outcomes of a probability experiment in simple cases.
-
•
identifies sets of sample points that are assigned probabilities. are called events.
-
•
When occurs, we say the event occurs.
-
•
If is finite or countable, then we can take . However for , the power set is too big for a reasonable measurable -Algebra, so we take Borel -Algebra .
States in give a complete system description, but it’s often too detailed and inaccessible. Instead, we are more interested in observables, or system statistics, which are called random variables in probability theory.
Definition 2.2 (Real Random Variable).
Let be a probability space. A real random variable is a measurable function
(15) |
Remarks.
is a fixed function, so where does the ”randomness” come from? In fact, all the randomness comes from the where is random in the probability space.
Distributions of probability over the events induce a distribution of probability over the values in of the random variable.
The term ”measurable function” used above is defined as:
Definition 2.3 (Measurable function).
A measurable function has the property that for all .
Based on this property, we can compute the probability that takes a value in :
(16) |
Definition 2.4 (Law of a real random variable).
Let be a probability space. Let be a real random variable.
The law (or distribution) of is the Borel probability measure on , s.t.,
(17) |
Notes. The probability that can be expressed as
(18) |
Definition 2.5 (Distribution function).
Let be a real random variable on a probability space with law . Define
(19) |
This is called a cumulative distribution function (CDF) of .
Note: It is important to make sure it’s a but not a in the definition.
Properties of CDF:
-
•
Monotonicity
-
•
Asymptotic
-
•
Right-continuous
-
•
Law: .
Continuous random variables: Law has a density with respect to the Lebesgue measure .
(20) |
where is the probability density function. In particular, is absolutely continuous.
Singular continuous: is continuous but has no densit1y functions.
3 Expectation
3.1 Definitions
Definition 3.1 (Expectation).
Let be a probability space and be a real random variable.
The expectation is the real number given by the integral:
(21) |
More formally, we define it using the Lebesgue integral:
(22) |
For real (not necessarily positive), if , we define .
Definition 3.2 (Integrable random variable.).
For a probability space , we define
(23) |
Note: Not all random variables are integrable, e.g., Cauchy random variables.
3.2 Change of variables
How to compute the expectation using the law? We have the change of variables proposition.
Proposition 3.3 (Law of the unconscious statistician, Lotus).
Let be a real random variable with law . Let be a function that is -integrable. Then
(24) |
Example. If is continuous with a density ,
(25) |
Expectation is linear. .
3.3 Convexity
Recall the (informal) definition of convex functions in :
(26) |
Proposition 3.4 (Subgradient property).
Suppose is a convex function. Then for all , there exists a subgradient vector .
(27) |
Theorem 3.5 (Jensen’s inequality).
Let convex. Assume is bounded below. Let be an integrable random variable, then
(28) |
Proof Sketch..
Use the subgradient property with and the inequality should pop out directly. ∎
4 Moments and Tails
How do we collect information about the distribution of a random variable?
4.1 Definitions
Definition 4.1 (Moment).
Let be a real random variable with law . A moment is an integral of a real test function (-integrable) against the law:
(29) |
Examples:
-
•
Indicator for Borel , which gives .
-
•
order moment, . Then .
-
•
-th polynomial moment: for . .
-
•
Exponential moment: let . .
Definition 4.2 (Tails).
Let be a real random variable, . The (right) tail is defined as . We often refer to the tail probability .
4.2 Moments and Tail Bounds
Several theorems on tail bounds bridge a connection between moments and tails.
Theorem 4.3 (Markov’s inequality).
Let be a real random variable and .
(30) |
Remark. For positive, increasing , we have
(31) |
We can use to introduce higher-order information of to the tail bound. sometimes yields a tighter tail bound of exponential decay, i.e., Chernoff bound.
In other words, polynomial moments control tail probabilities.
Theorem 4.4 (Integral by parts).
Let be a positive real random variable and be an increasing and continuously differentiable function. Then,
(32) |
Remark.
-
•
When for some , then
(33)
This means tail probability controls moments.
Assume is a real random variable with for some . Then we can bound the -th moment for .
(34) | ||||
(35) | ||||
(36) | ||||
(37) |
5 Spaces
Definition 5.1 ( space).
For , the space is
(38) |
Roughly, random variables with tail decay rate at least . These random variables are called “-integrable random variables”.
Remarks. is a linear space.
Definition 5.2 (Homogeneous p-th moment.).
For , we define the homogeneous p-th moment of by
(39) |
Sometimes denoted as .
Theorem 5.3 (Lyapunov inequality).
For , real random variables have
(40) |
Therefore,
(41) |
Proof Sketch..
Use Jensor’s inequality for . ∎
Warning. For sequence spaces, . For Lebesgue spaces .
5.1 Pseudonorm
Question: Is a norm? We would need to prove the triangular inequality.
Theorem 5.4 (Hlder’s inequality).
Let be real random variables with with ,.
(42) |
Proof.
We use a lemma called Young’s inequality. If , then . We consider and , and take expectation. ∎
Theorem 5.5 (Minkowski).
Assume . Let . Then
(43) |
Proof.
(44) |
Apply Holden with .
(45) |
After dividing , the triangular inequality pops out. ∎
Proposition 5.6 ( pseudonorm).
For random variables in , is a pseudonorm
-
•
Positive semi-definite: and .
-
•
Positive homogeneous: .
-
•
Triangle inequality: .
-
•
Almost positive: implies -almost sure.
We can thus define a pseudometric on :
(46) |
5.2 Convergence in
Definition 5.7.
A sequence in converges in if there is a real random variable , such that
(47) |
Notes. in implies in for .
Question: When does a sequence in converge?
Definition 5.8 (Cauchy sequence).
A sequence in is called Cauchy if
(48) |
It is trivial that any converging sequence is Cauchy, but the other direction is not trivial.
Theorem 5.9 ( space is complete.).
Every Cauchy sequence in converges to a random variable. Moreover, it converges to a random variable in .
Notes. An example of non-complete space: . A sequence of rational numbers may converge to an irrational number.
Remark. Limits in are not necessarily unique! They are only required to be equal -almost surely.
5.3 Space
Next, we restrict our discussion to space, which is special for setting orthogonality, variance, covariance, and orthogonal projection.
(49) |
Theorem 5.10 (Cauchy-Schwarz).
If , then , and
(50) |
This is a special case of Hlder’s theorem when . Nevertheless, here is another simpler proof
Proof.
Let .
(51) |
which is a quadratic function of . Hence . ∎
Definition 5.11 ( (pseudo)-inner product).
For random variables , we define
(52) |
Remarks.
-
•
This is well-defined by Cauchy-Schwarz.
-
•
.
Definition 5.12 (Orthogonality).
If , then we say and are orthogonal random variables, and we write .
Warning: does NOT imply that are ”independent”.
Definition 5.13 (Covariance).
Let , the covariance of is
(53) |
If , we say are uncorrelated, not necessarily independent!
Moreover, we define the variance of a random variable by .
The correlation of is defined by
(54) |
Proposition 5.14 (Pythagorean).
If , then implies
(55) |
Theorem 5.15 (Orthogonal projection).
Let be a complete linear subspace of .
For , there is a random variable such that
-
•
Primal: .
-
•
Dual: for all .
These two statements are essentially the same. We call a version of the orthogonal projection of onto .
Remark. Such are not unique. We can have almost surely.
Proof.
Consider a minimizing sequence in : , such that is decreasing and converges to .
Step 1: Show that has a limit in , which is a candidate for the projection. Check that the sequence is Cauchy.
(56) |
So is Cauchy in . Since is complete, we know for some . is a candidate for the orthogonal projection of onto .
Step 2: Check that is indeed an orthogonal projection. By Minkowski, for each i
(57) |
which converges to . Moreover, , so .
Step 3: Dual characterization. For , let , since is a linear subspace. Then
(58) |
Expand the equation and it’s trivial to see we must have . ∎
6 Independence
Let be a probability space. How to update our knowledge about an event , when another event has occurred?
(59) |
We define the -algebra .
Then is a probability measure space.
Definition 6.1 (Independence of events).
Suppose knowledge that occurs does not change the probability of :
(60) |
Then we call events and are independent. It is equivalent to saying
(61) |
6.1 Independent for random variables
How about random variables being independent? Random variables can take any value in their domain, so we must consider a series of events to be independent.
Definition 6.2 (Independence of random variables).
Let be real random variables. We say that are independent when
(62) |
for all .
In particular, for all , it is common to see the definition below,
(63) |
Fact: These two definitions are equivalent.
This is also equivalent to
(64) |
Moreover, it is equivalent to on .
Proposition 6.3.
Let be independent real random variables. Let and . Then
(65) |
Proof.
(66) | ||||
(67) | ||||
(68) |
∎
Question: Can we generate a collection of random variables with specified marginals? For a countable sequence, the answer is yes!
Theorem 6.4 (Kolmogorov extension).
Let be a collection of Borel probability measures on . Define
(69) |
Consider coordinate random variables .
We can define product -algebra:
(70) |
There exists a product probability measure in , such that
(71) |
6.2 Independence for sigma-algebras
Note that -algebras carry information. We can also define the independence of -algebras.
Definition 6.5 (Independence of -algebra).
Let be a probability space. Let be sub--algebra of .
We say that are independent if for all and ,
(72) |
For events .. We can check that and are independent if and only if and are independent -algebras.
For real random variables: . Then real random variables are independent if and only if , are independent -algebras.
7 Law of Large Numbers
Definition 7.1 (Stochastic process).
Let be a probability space. A stochastic or random process is a family of real random variables, indexed by an abstract set .
-
•
Discrete time: .
-
•
Continuous time: .
-
•
Space: , ”random fields”.
Key models covered in this note: Independent sum; discrete-time martingales.
Definition 7.2 (Independent sum).
Let be an independent sequence of real random variables.
The partial sums:
(73) |
compose a discrete-time random process, called an independent sum process.
Examples.
-
•
Repeated independent experiments.
-
•
Random walk.
-
•
Renewal processes.
Definition 7.3 (Running average process).
Let be independent real random variables. is called the running average process.
(74) | |||
(75) |
7.1 Weak Law of Large Numbers
Definition 7.4 (Convergence in probability).
A sequence of real random variables converges in probability to a random variable , if
(76) |
Theorem 7.5 (Chebyshev’s weak LLN).
Let be a real random variable and are iid copies of . Let . Then
(77) |
7.2 Strong Law of Large Numbers
Definition 7.6 (A.s. Convergence).
Consider a sequence of real random variables. We say that almost surely, if when ,
(78) |
Equivalently,
(79) |
Remark.
(80) |
Theorem 7.7 (Kolmogorov Strong LLN).
Let be a real random variable. Let be iid copies of . Let . Then a.s. That is,
(81) |
Theorem 7.8 (Cantelli Strong LLN).
Assume . Then the previous theorem has a simpler proof
Proof.
WLOG, assume .
Lemma 1 (Borel Cantelli.).
Let be events.
(82) |
Proof of the lemma..
(83) |
(84) | ||||
(85) |
∎
Lemma 2 (Cartelli tail bound.).
Assume , . Then ,
(86) |
Cont. Define event . Then
(87) |
Then using the Borel-Cantelli lemma.
(88) |
∎
7.3 Concentration
The law of large numbers gives an asymptotic analysis for the limit of . However, we also want nonasymptotic analysis for a fixed .
Proposition 7.9 (Chebyshev).
Let , and ,
(89) |
-
•
Chebyshev’s inequality gives weak controls on tail bound (), but this is the best we could get by assuming only.
Definition 7.10 (Moment generating function (mgf) and cumulant generating function (cgf)).
Let be a real random variable. The moment generating function is defined as
(90) |
and the cumulant generating function
(91) |
Proposition 7.11 (Laplace transform method).
Let be a real random variable. For all ,
(92) |
(93) |
Proof Sketch..
Simply apply Markov’s inequality to . ∎
Example. (Normal distribution) Recall that for , we have . Then
(94) |
Proposition 7.12 (Additivity of cgf).
Let for independent random variables , then
(95) |
Thanks to the additivity of cgf, we can easily derive a tail bound for independent sums,
Theorem 7.13.
Let for independent .
(96) | |||
(97) |
Example. (Binomial) Let , so where iid.
(98) |
So we get (the Chernoff bound)
(99) |
Theorem 7.14 (Hoeffding’s theorem).
For for independent bounded random variables . Let . Then
(100) |
Proof.
We need a cgf bound for . We find that . This is because
(101) | ||||
(102) |
where so on . Plug in proves the theorem. ∎
7.4 Weak convergence
Limit theorems in probability tell us when a sequence of random variables converges to a limiting random variable. In this section, we consider describing when the distributions of the random variables converge to a limiting distribution.
What does it mean for the distributions of random variables to converge?
Idea: Moments carry information about distributions.
(103) |
If two measures are close, then many moments are similar: for “many” .
Definition 7.15.
A function is bounded Lipschitz, if
(104) |
where
(105) |
and
(106) |
Remark. is a norm.
Proposition 7.16 (BL functions separate measures.).
Let be Borel probability measures on .
Then if and only if for all bounded Lipschitz .
Proof.
The forward direction is trivial.
The reverse direction. Assume , then the cdf , so there exists ,
(107) |
Let us consider a series of BL functions:
(108) |
Using bounded convergence theorem (or dominated convergence), and . Therefore, there exists , such that . ∎
Definition 7.17 (Bounded Lipschitz metric).
Let be Borel probability measures on . Define
(109) |
Note: is a metric on the probability measures on , i.e.,
-
•
if and only if .
-
•
.
-
•
.
Definition 7.18 (Weak convergence).
Let be Borel probability measures on , be a Borel probability. We say converges weakly to , if
(110) |
Let be real random variables and be a real random variable. Then we say weakly converges to , if weakly converges to .
Theorem 7.19.
Let , be Borel probability measures on . The following are equivalent:
-
•
weakly converges to
-
•
for all bounded Lipschitz .
-
•
for all bounded continuous .
-
•
The distribution function converges to for all where us continuous.
-
•
The characteristic function point-wise, where
(111)
Proposition 7.20.
almost sure implies that converges weakly to .
Definition 7.21 (Integral probability metric).
Let be a collection of functions from . Define the IPM related to by
(112) |
for all Borel probability measure on .
Remark.
-
•
Different gives different notions of distance
-
•
If , then
-
•
is a metric if and only if separates measures.
Examples
-
•
Kolmogorov: , which induces uniform convergence of cdfs.
-
•
Total variation: or .
-
•
Kantorovich Wasserstein-1 distance
(113) Convergence weakly and .
7.5 Central Limit Theorem
Let be iid copies of , then converges almost surely to .
Idea: rescale to make its variance nontrivial. We define standardized sums
(114) |
It can be checked that and .
Theorem 7.22 (Central Limit Theorem).
Let be iid copies of . Define
(115) |
Then weakly converges to as .
Remark.
-
•
Historically, the central limit theorem is used to support asymptotic confidence intervals for .
-
•
Equivalent statements:
(116) Also equivalent to
(117) which is often called the convergence in distribution.
7.6 Nonasymptotic counterpart
Theorem 7.23 (Berry-Esseen).
Let be iid copies of . Define
(118) |
Let be the standardized sum, then the Kolmogorov distance
(119) |
Proof of a weaker version by the Lindeberg exchange principle..
The idea is to show for smooth functions ,
(120) |
Lemma 3.
Let be random variables with and . Let with .
(121) |
Proof of Lemma..
(122) |
Then
(123) |
The left hand side is1 the same for and , except for the first term, so
(124) |
∎
Let and be independent with and . For , such that , then
(125) |
This is called the Lindeberg exchange theorem. To see this, let
(126) |
Then since and ,
(127) |
Let , in be independent random variables that are already standardized, i.e., and . Let . Let be smooth. Define
(128) |
. Then applying the Lindeberg exchange principle,
(129) |
Let
(130) |
(131) |
∎
Issues:
-
•
Where is the normal random variable?
Solution: Choose , then we can compute .
-
•
What is ?
Solution: For each with , we can smooth to get a function such that is bounded and is bounded, which implies
(132)
8 Conditional Expectation
If two random variables are dependent, by observing one, we can update our knowledge about probabilities of events involving the other. In particular, we can update our “best guess” for the expectation.
Fix a probability space Let .
Recall that .
8.1 Conditional expectation in least squares
Consider a subspace in :
(133) |
is a complete subspace in , so there is always a projection from to , i.e.,
(134) |
Now suppose we observe a random variable and want to update our best guess of . We first try an affine approximation:
(135) |
where .
(136) |
We can compute the optimal given and .
As , the error is no worse than .
However, this affine function of is suboptimal (consider ). In general, we have to consider a nonlinear function of .
Consider .
(137) |
Recall that
Lemma 4.
is -measurable, if and only if for some measurable.
As a result, is a complete subspace of , since
(138) |
Therefore, there exists an orthogonal projection of to . Moreover, it is unique almost sure.
We define this projection by .
8.2 Conditioning on a -algebra
Consider conditioning on a -algebra: . For each event , we can decide if or not.
Definition 8.1 (Conditional expectation in ).
Fix . Let be a sub--algebra on .
For ,
(139) |
A solution to this least square problem is the conditional expectation of , given , denoted as .
Dual Characterization. for all .
Remarks.
-
•
is a random variable on the sample space. Value is determined, once given for all .
-
•
For random variables , .
-
•
For an event , . The conditional expectation must be .
The optimal value of is and .
8.3 Characteristic properties of conditional expectation
Let be a disjoint cover of , . Let .
Measurability: is a constant on each minimal event . Therefore, is -measurable, i.e., , for all Borel .
Consistency: Average of on events in equals the average of on the event.
(140) |
This is also often called the coarse-graining property.
Proof.
(141) |
Then . Since is measurable in , the statement is proved. ∎
8.4 Conditional expectation in
So far we have only defined conditional expectation in . How can we make it work if it only belongs to ?
Definition 8.2 (Kolmogorov, 1933).
Let be a probability space. Let be a -algebra on . Let . A real random variable is a version of if
-
•
is integrable:
-
•
is -measurable
-
•
Consistency: for all .
Theorem 8.3 (Fundamental theorem of conditional expectation).
There exists a version of the conditional expectation . If is a another version, then .
Proof Sketch..
Existence. WLOG assume . Let .
is positive, bounded, -measurable, and mono-increasingly converges to point-wise.
bounded implies that , so we can compute its conditional expectation . We can check that . Define
(142) |
By monotone convergence theorem, . Following a similar argument, for all .
Moreover, is -measurable, since are measurable.
Uniqueness. Assume are both versions of conditional expectations. By consistency for all . For contradiction, assume .
Let , which are -measurable events. mono-increasing converges to . This means mono-increasingly converges to . Therefore, there exists , such that , contradiction. ∎
Proposition 8.4 (Conditional expectation is an expectation.).
Conditional expectation satisfies the same properties as expectation.
-
•
Unital and positive.
-
•
Conditional expectation is linear.
-
•
Monotonic. If a.s., then a.s.
-
•
Jenson’s inequality holds: a.s for convex .
8.5 Convergence theorems
Theorem 8.5 (Conditional Monotonic convergence).
Suppose a.s. where . Then
(143) |
Proposition 8.6.
Some properties for conditional expectation:
-
•
Expectation: If , then
-
•
Full knowledge: If is -measurable, then a.s.
-
•
Independence: If independence of , then , then a.s.
-
•
Pull-through: If is bounded and -measurable, then , a.s.
-
•
Tower: If are increasing -algebras. Then
(144)
Conditional integration:
(145) |
8.6 Gaussian and conditioning
Definition 8.7 (Multivariate normal random variable).
is a multivariate normal random vector if it is an affine function of a standard normal vector .
(146) |
where and .
Remark.
-
•
, where .
-
•
We denote , as uniquely determine a multivariate normal random variable.
Definition 8.8 (Multivariate characteristic function).
Let be a random vector on . The characteristic function of is
(147) |
We can check that if , then
(148) |
If are independent random vectors on , then
(149) |
It is easy to check that for ,
(150) |
Theorem 8.9.
Suppose are random vectors taking values in , then the laws of and are the same if and only if the characteristic functions are the same.
Theorem 8.10 (Linear marginals).
A random vector on is normal if and only if is a real normal random variable for every .
8.6.1 Gaussian conditioning
Let be multivariate normal distribution. . Assume and .
We want to find the best approximation of as a linear function of :
(151) |
We define , , and .
Then
(152) |
By taking the derivative with respect to ,
(153) |
The best approximation is thus
(154) |
This means that
(155) |
Then, . Specifically, for normal distributions, uncorrelation means independence (which can be shown using characteristic functions as normal random variables are determined by covariance).
Theorem 8.11 (Gaussian conditional expectation.).
Different from general random variables, linear functions gives full information for conditioning.
(156) |
Proof.
(independence and full information) | ||||
∎
9 Martingales
Recall that a -algebra captures information about the world. A bigger -algebra includes more knowledge.
A filtration is a mathematical model for accumulating information.
9.1 Filtration
Definition 9.1 (Filtration).
Let be a probability space. A (discrete-time) filtration is a sequence of increasing -algebras on :
(157) |
Commonly, define .
Examples.
Let be real random variables, and . Then is a filtration.
Definition 9.2 (Adapted process).
A sequence of real random variables is adapted to the filtration if each is -measurable.
In other words, at time , the value of is determined by the information we have.
9.2 Defining Martingales
Definition 9.3 (Martingales (Informal)).
A martingale is an adapted process that is indifferent about the future.
That is, for , .
Definition 9.4 (Martingales (Formal)).
Let be a probability space, and a filtration . An sequence of real random variables is a martingale if
-
•
Adapted: is adapted to the filtration
-
•
Integrable: , for each .
-
•
Status Quo: almost sure for each .
We can relax the third property to a.s., in which is called a supermartingale.
a.s. is called a submartingale.
This seemingly awkward terminology comes from the super/sub-harmonic function of Markov chains.
Examples. Independent sums of centered random variables, Random walks, Levy-Doob process.
9.3 Formal model for Gambling strategies
Definition 9.5 (Previsible process.).
A sequence of real random variables is a previsible with respect to the filtration if is -measurable for all .
When you play a game of chance, you bet before the game.
Definition 9.6 (Martingale transform).
Let be a martingale and be a previsible sequence. The martingale transform is the sequence
(158) |
Remark. You bet units of capital on a fair game (martingale). Then you win on the -th day. The total winnings after plays is
(159) |
Proposition 9.7 (Martingale transform.).
Let be a martingale and be a previsible process that are bounded. Then
(160) |
is a null martingale, i.e., .
Proof.
Pull-through property:
(161) |
Since is measurable,
(162) |
∎
We interpret this proposition: No sequence of bounded bets allows you to beat a fair game.
Another question is: can you quit the game when you are ahead to get advantages in a fair game?
Definition 9.8 (Stopping times).
A random variable is a stopping time, if is measurable for each .
Remark. The meaning of this definition is that after game , you have enough information to decide whether the stopping criterion is triggered.
Definition 9.9 (Stopped process).
Let be an adapted process and be a stopping time. The stopped process is the sequence
(163) |
The idea of this definition is that your winnings freeze after you stop gambling.
Definition 9.10 (Stopped martingales.).
If is a martingale, is the stopping time, then the stopped process is a martingale.
Remark. This means you cannot gain an advantage in a fair game by quitting strategically.
Proof.
Consider a previsible process ,
(164) |
Since is previsible and bounded, the martingale is a martingale with an initial value 0. This means
(165) |
Therefore,
(166) |
∎
Warning: It is not true that as is a random variable dependent on the outcomes of . For example, we stop when , then .
However, with the following theorem, we circumvent this paradox by restricting the horizon of our game plays.
Theorem 9.11 (Optional stopping).
Let be a martingale. and be a stopping time. If for a fixed almost surely, then
(167) |
Proof.
By the proposition on the stopped martingales,
(168) |
Choose , so
(169) |
∎
9.4 Convergence of martingales
Martingales are flavored by probability theorists because they converge to a stable equilibrium under weak assumptions.
Theorem 9.12 (Doob’s convergence theorem).
Let be a martingale sequence that is uniformly bounded in , i.e., there exists a positive , such that for all .
Then almost surely, exists, and is finite. That is, converges to almost surely.
Remark. It is not true in general that converges to
Lemma 5 (Interval sandwich).
A nonrandom real-valued sequence fails to converge if and only if
(170) |
for some .
We define “upcrossing” of a sequence to an interval.
Definition 9.13.
Let be a nonrandom real sequence. Fix . The number of upcrossings before time is the largest , such that
(171) |
where and .
The total number of upcrossings:
(172) |
Lemma 6.
Equation 171 implies . If is finite for all rational pairs with , then converges in .
9.4.1 Proof of Doob’s convergence
Idea: if a martingale crosses an interval infinitely times, we can make money by betting on the upcrossings. This is impossible.
Specifically, we keep betting below , and start betting when gets below . We formalize it as below.
Fix . Let .
Let .
We can check that is positive, bounded, and previsible. Then the martingale transform is a null-martingale.
Define as the number of upcrossings of up to time by the sample path , and .
Note that
(173) |
Since , so we have the following proposition:
Proposition 9.14 (Snell upcrossing inequality).
Let be a martingale and be real numbers.
(174) |
As , then for any ,
(175) |
Define an event ,
(176) |
Therefore, , because rational number set is countable. Thus, with probability 1. Moreover,
(177) |
9.5 Concentration inequalities for martingales
What is the probability that a martingale ever deviates much from its mean value?
Begin with submartingales: a.s.
Theorem 9.15 (Doob’s maximal inequality for submartingales).
Consider a positive submartingale . For each ,
(178) |
Moreover, if in , then it’s true for .
Remark. Though this statement looks like Markov’s inequality, we cannot derive it directly from Markov’s inequality.
Proof.
We want to bound the probability that the martingale escapes a band. We define the stopping time ,
(179) |
Therefore, the event
(180) |
Then,
(181) |
Therefore,
(182) |
Moreover, mono-increasingly converges to , and by convergence. By dominated convergence theorem the statement holds for . ∎
Proposition 9.16 (Martingale: convex transformation.).
Consider a martingale . For any convex function , the sequence
(183) |
is a submartingale, provided that .
Proof.
By conditional Jensen’s inequality. ∎
We can then generalize Chebyshev’s inequality to Martingales,
Theorem 9.17 (Kolmogorov’s inequality).
Let be a martingale in . For ,
(184) |
Proof.
The sequence is an martingale, so is a submartingale, due to Proposition 9.16. Using Doob’s maximal inequality the statement follows. ∎
Remark. Define . Then
(185) |
Proposition 9.18 (Exponential maximal inequality).
Let be a bounded martingale. For ,
(186) |
where is the cgf of .
Theorem 9.19 (Azuma-Hoeffding: uniform version).
Consider a martingale whose difference sequence is bounded, . Let the variance proxy .
(187) |
Proof.
Hoeffding lemma: If is a real random variable with , , then its mgf . As ,
(188) |
Therefore,
(189) |
Using this result with in the Exponential maximal inequality proves the theorem. ∎
Theorem 9.20 (Ville’s inequality).
Let be a positive supermartingale. Then
(190) |