Analysis and Design of Algorithms

1 Probability Bounds

Markov Inequality

For a nonnegative random variable X , for any k > 0 ,

Pr ( X k ) 𝔼 ( X ) k (1)

If we have the variance of a random variable, we can apply more detailed analyses:

Chebyshev Inequality

Let X be a random variable with 𝔼 ( X ) = μ and V a r ( X ) = σ 2 , then

Pr ( | X μ | t ) σ 2 t 2 . (2)

The proof is straightforward after applying Markov inequality to | X μ | .

Chernoff Bound

Let W i be independent random variables. W = W 1 + + W n . Chernoff bound takes advantage of higher-order conditions of independent variables. For Bernoulli random variables with probability p ,

Pr ( W ( 1 + δ ) 𝔼 [ W ] ) ( e δ ( 1 + δ ) δ ) μ (3)

For X i [ 0 , 1 ]

[ P r ( X ( 1 + δ ) μ ) ] exp δ 2 μ 3 , [ P r ( X ( 1 δ ) μ ) ] exp δ 2 μ 2 , Pr [ X R ] exp R , for R 5 μ (4)

The proof is by considering the random variable e t W , which contains higher-order information of W .

1.1 An Example: Median Finding

Goal:

  • Input a set S of integers

  • Find the median m S : at least n 2 elements m and at least n 2 + 1 elements m .

We want to find d , u S , such that d m u , with C = { s S , d s u } small enough. Specifically, we do

  • Count number of elements in S larger than u , and smaller than d .

  • Sort C in O ( | C | log | C | ) = o ( n ) time.

  • Deduce the median.

How to choose C ? We first select a multi-set R of n 3 / 4 elements iid from S , and sort R .

  • Let d be the 1 2 n 3 / 4 n smallest element and u be the 1 2 n 3 / 4 + n smallest one.

  • Compute C , L d = { s S , s < d } , L u = { s S , s > u } . If | L d | or | L u | n 2 , we fail. If | C | > 4 n 3 / 4 , we fail.

  • Otherwise, we sort C and return the n 2 | L d | smallest element in C .

The runtime is O ( n ) indeed, but we have to bound the probability of failure.

Y 1 = 𝟙 { | { r R : r m } | < 1 2 n 3 / 4 n } . If Y 1 = 0 , then | L d | 1 2 .

Y 2 = 𝟙 { | { r R : r m } | < 1 2 n 3 / 4 n } . If Y 2 = 0 then | L u | 1 2 .

Y 3 = 𝟙 { | C | > 4 n 3 / 4 } .

Pr ( Y 1 = 1 ) 1 4 n 1 / 4 by applying Chebyshev bound. (Could get better bound if using Chernoff bound?)

Pr ( Y 3 = 1 ) 1 2 n 1 / 4 . Either 2 n 3 / 4 elements larger than m or 2 n 3 / 4 elements smaller than m . Suppose the first happens, then R has at least 1 2 n 3 / 4 n samples that are among the 1 2 n 2 n 3 / 4 largest elements in S . (Consider the rightmost interval in R .) Let R = { a i } i = 1 n 3 / 4 , and X i = 1 if a i is larger than the 1 2 n + 2 n 3 / 4 element. 𝔼 ( X i ) = 1 2 2 n 1 / 4 . 𝔼 ( X i ) = 1 2 n 3 / 4 2 n . Then

Pr ( i = 1 n 3 / 4 X i 1 2 n 3 / 4 n ) 1 2 n 1 / 4 (5)

By union bound, we fail with probability less than n 1 / 4 .

2 Flows

A graph G = ( V , E ) . Compute its maximal flow.

Ford-Fulkerson algorithm O ( m F ) , where F = C Δ s , C is the largest flow capacity.

Edmands Karp O ( m 2 n ) .

Theorem 2.1.

The maximal flow from s to t equals the min-cut that separates s , t .

Note: Recursively reducing paths from s to t gives the maximal flow. In the end s and t will be disconnected, which naturally yields the solution for a Min-Cut.

2.1 Applications

Bipartite Matching Using max-flow algorithms to find maximum matching for bipartite matching.

Theorem 2.2 (Hall).

A bipartite graph has a perfect matching iff S L or S R , | Γ ( S ) | | S | , where Γ ( S ) is the set of neighbors of S .

Image Segmentation Let f i be the cost of assigning pixel i to the foreground. b i is the cost of assigning it to the background. And s i j is the cost of separating i and j .

min S cost(S) = i S f i + i S C b i + i S , j S C s i j (1)

This can be translated into a s , t Min-cut problem.

2.2 Min-Cut

Instead of finding an ( s , t ) - Min-Cut, we find an ( S , S C ) partition of vertices V .

Karger’s randomized algorithm

Repeatedly contract two nodes together, randomly, until only two supernodes are left. The runtime would be O ( n 2 ) . ( O ( n ) for each edge contraction.)

This randomized algorithm returns the correct answer with probability

p 1 ( n 2 ) (2)

Repeat n 2 log n time, so the overall runtime would be O ( n 4 log n ) time to achieve a O ( 1 / n c ) failure probability.

Proof Sketch..

When there are t supernodes left, we have

Pr [ e δ ( S , S C ) ] 2 t (3)

This is because each node must have degree k if the minimum cut is k .

Then we multiply the probability together, which yields the final bound. ∎

Corollary 2.3.

There are at most ( n 2 ) number of min-cuts in a graph.

2.3 Complexity classes of BPP: randomized poly time

A randomized algorithm that runs in p o l y ( n ) time, with correct probability 2 3 for any input x .

Boosting: Run an algorithm in BPP n times independently and take a majority vote, then by Chernoff bound,

Pr [ C o r r e c t ] 1 exp ( n 48 ) . (4)

3 Streaming Algorithms

Inputs coming in consistently. Cannot store all data. Need an online algorithm to deal with incoming data.

Setup.

Inputs: σ = ( s 1 , s 2 , s m ) , where s j { 1 , , n } .

Assumption: The length of input L e n ( m ) is known, and the range is known.

Let f i = | { 1 j m } | s j = i | .

F 0 = i = 1 n ( f i ) 0 = # of distinct elements , where 0 0 = 0 .

F 1 = i f i = m , space complexity: log m .

What about F 2 = i f i 2 ? Can we find an algorithm with logarithmic space complexity?

3.1 Deterministic algorithm fails for F 2

Assume there exists a deterministic algorithm to compute F 2 exactly. We construct a stream:

( ( 1 , x 1 ) , , ( n , x n ) ) , where x { 0 , 1 } n (5)

Run the algorithm. Assume the memory at the end is m ( x ) { 0 , 1 } S .

We then initialize the memory to m ( x ) and feed and ( i , 0 ) to the algorithm. There are two possibilities:

  • F 2 increases by 1, if x i = 1 ,

  • F 2 increases by 2 2 1 = 3 , if x i = 0 .

It’s possible to completely recover x by m ( x ) , so m ( x ) must have at least 2 n possible values. The space complexity is Ω ( n ) .

3.2 Randomized algorithm for F 2

Hash function h : { 1 , , n } { 1 , 1 } .

  • Initialize: set c = 0

  • Process s j : Add h ( s j ) to c .

  • Return c 2 .

Space: m c m , so it takes O ( log m ) space.

We show that Z = c 2 is an unbiased estimation of F 2 . For j { 1 , , n } , let Y j = h ( j ) { ± 1 } .

Z = ( f 1 Y 1 + + f n Y n ) 2 (6)

so

𝔼 [ Z ] = i = 1 n f i 𝔼 [ Y i ] 2 + i j f i f j 𝔼 ( Y i ) 𝔼 ( Y j ) = i = 1 n f i 2 = F 2 . (7)

We now bound the variance of Z to apply tail bounds.

𝔼 [ Z 2 ] = i , j , k , l f i f j f k f l 𝔼 [ Y i ] 𝔼 [ Y j ] 𝔼 [ Y k ] 𝔼 [ Y l ] (8)
= i = 1 n f i 4 𝔼 [ Y i 4 ] + 3 i j f i 2 f j 2 𝔼 [ Y i 2 ] 𝔼 [ Y j 2 ] (9)
= F 4 + 3 ( F 2 2 F 4 ) , (10)

so V a r ( Z ) = 2 F 2 2 2 F 4 2 F 2 2 .

Pr [ | Z F 2 | ϵ F 2 ] 2 F 2 2 ϵ 2 F 2 2 = 2 ϵ 2 . (11)

This bound is great, but not so satisfactory. Actually, using the median of means trick, we can get an exponential boost, i.e., a ( 1 ± ϵ ) w.p. at least ( 1 δ ) using space O ( 1 ϵ 2 log ( 1 / δ ) log m ) .

3.3 Derandomization

Definition 3.1.

A family of random varibles ( x 1 , , x N ) is called k -wise independent if for every k -tuple ( i 1 , , i k ) { 1 , , N } k , ( x i 1 , , x i k ) are independent.

Specifically, when they are N -wise independent, it is called fully independent.

Remark. The randomized algorithm above does not need full independence. In fact, all we need is a 4 -wise independence (required in the analysis of variance). This is also the reason why we can construct such a hash function in O ( log n ) but not O ( n ) .

Definition 3.2.

A family H of hash functions h : A B is called k -wise independent if for any distinct points x 1 , , x k A and i 1 , , i k B ,

Pr h H ( h ( x 1 ) = i 1 , , h ( x k ) = i k ) = 1 | B | k . (12)

3.4 Construction of k -wise independent hash functions

Let A = B = p := { 0 , 1 , p 1 } , where p is a prime.

H 2 = { f a , b : x a x + b mod p , ( a , b ) p 2 } . (13)

Then

Pr a , b ( a x 1 + b = i 1 a x 2 + b = i 2 ) = Pr a , b ( a = i 2 i 1 x 2 x 1 b = i 2 x 2 a ) = 1 p 2 . (14)

So this is a 2-wise independent hash function family. We get h ( 0 ) , , h ( p ) p random variables with 2 log p space.

We can extend to x i = 0 k 1 a i x i mod p to create a k -wise independent hash function family, with space complexity of k log p .

3.5 Algorithm for F 0 .

Function h : { 1 , , n } { 1 , , n } .

  • Initialize c = 0

  • Process s j : Let z be the largest power of 2 that divides h ( s j ) . If z > c , c z .

  • Return 2 c + 1 / 2 .

Intuition:

If the stream has d distinct elements, there is a good chance that one of the d values of h ( s j ) will be divisible by d .

If there are no more than d distinct elements, it’s unlikely that an h ( s j ) has more than log d zeros in a row.

j = { 1 , , n } , r 0 , X r , j : indicator random variable that h ( j ) is divisible by 2 r .

Y r = j : f j > 0 X r , j . So if Y r = 0 , no elements have r or more zeros, so c r 1 .

𝔼 [ X r , j ] = Pr ( h ( s j )  has r zeros ) = 1 2 r . (15)

so

𝔼 [ Y r ] = F 0 2 r , V a r ( Y r ) = j : f j > 0 V a r ( X r , j ) = F 0 2 r . (16)
Pr [ Y r > 0 ] F 0 2 r , Pr [ Y r = 0 ] 2 r F 0 . (17)

4 Johnson-Lindenstrauss Lemma

JL Lemma states that we can map data that lie in a high-dimensional space to a low-dimensional one, preserving some of the structures in data.

Lemma 1 (JL Lemma).

Let S be a set of n points in d . For some dimension k = O ( ln n ϵ 2 ) , there exists a matrix A k × d such that u , v , S ,

( 1 ϵ ) u v 2 A u A v 2 ( 1 + ϵ ) u v 2 . (18)

4.1 Norm Preservation

Lemma 2.

For any integer d > 0 , 0 , ϵ , δ < 1 , and integer k > 4 log ( 2 δ ) ϵ 2 ϵ 3 , there exists a distribution on k × d real matrices, such that for any x d ,

Pr ( ( 1 ϵ ) x 2 A x 2 ( 1 + ϵ ) x 2 ) > 1 δ . (19)
Proof of JL Lemma by lemma 2..

Let δ = 1 n 3 . We apply union bound over u , v , and the proof is done. ∎

In fact A i j 𝒩 ( 0 , σ 2 ) is the construction we need for this lemma.

Proof of lemma 2.

Random choose the entry of A k × d . A i j 𝒩 ( 0 , 1 k ) .

( A x ) 1 = i = 1 d A 1 , i x i (20)

This is a random variable 𝒩 ( 0 , x 2 k ) .

𝔼 [ A x 2 ] = k 𝔼 ( A x 1 ) 2 = k x 2 k = x 2 . (21)

Now let Z i 𝒩 ( 0 , i = 1 d x i 2 k )

Pr ( A x 2 ( 1 + ϵ ) x 2 ) = Pr ( i = 1 k Z i 2 ( 1 + ϵ ) x 2 ) (22)
= Pr ( i = 1 k Y i 2 ( 1 + ϵ ) k ) , where Z i 𝒩 ( 0 , 1 ) (23)
= Pr ( exp ( i = 1 k Y i 2 ) e ( 1 + ϵ ) k ) (24)
i = 1 k 𝔼 ( e t Y i 2 ) e t k ( 1 + ϵ ) (25)
= 1 ( 1 2 t ) k / 2 e t k ( 1 + ϵ ) (26)

The last equality holds because 𝔼 ( e t X 2 ) = 1 1 2 t for X 𝒩 ( 0 , 1 ) , t < 1 2 . Then applying the fact that 1 x e x 1 x + x 2 2 and let t = ϵ 2 ( 1 + ϵ ) , k > 4 log ( 2 / δ ) ϵ 2 ϵ 3 ,

Pr ( A x 2 ( 1 + ϵ ) x 2 ) δ 2 . (27)

4.2 Packing

In k we can only have k perfectly orthogonal vectors. Let e 1 , , e n n be unit orthogonal vectors. After applying JL lemma, we have e 1 , , e n k . Then e i , e j ϵ for i j .

This means that for k 1 , 0 < ϵ < 1 , n = e Ω ( ϵ 2 k ) near-orthogonal vectors in k .

Lower bound. Let e i , e j be two ϵ -near orthogonal vectors in k , then two unit balls with centers e i , e j with radius 1 2 2 2 ϵ are disjoint. However, the total volume of the ball with radius 1 + 2 2 ϵ 2 is O ( e O ( ϵ d ) ) .

With a more fine-grained analysis, we can get a tighter lower bound O ( e Ω ( ϵ 2 d ) )

4.3 Application: Approximate Nearest Neighbor

x 1 , , x n d . Given a query y d , find the closest x i to y .

If d is small, we have efficient data structures with quasi-linear in space and in log n time.

If d is large, we need space O ( n d ) and query O ( n d ) ; or n O ( d ) in space and query time O ( d log n ) .

Definition 4.1 ( ϵ -Approximate NN).

Given y , an ϵ -approximate nearest neighbor is an output i such that y x i ( 1 + ϵ ) min j y x j .

Assume the closest point to y has distance 1.

  • Preprocess:

    Fix a grid on d with side length ϵ / d . Let G i be the set of grid cells that contain at least one point at a distance at most 1 from x i .

    Store ( c , i ) for each c G i .

  • The space consumption for each i :

    2 d / d d / 2 ( ϵ d ) d ϵ d . (28)
  • The query time is O ( d ) .

How to do better using JL lemma?

In the preprocessing phase, project x i to a space with dimension d = O ( log n e 2 ) .

The space cost is then

n O ( log ( 1 / ϵ ) / ϵ 2 ) (29)

Time cost of projection during a query is d ϵ 2 log n time, so the time is

d ϵ 2 log n (30)

5 Linear Programming

Standard form

max x c T x
s . t . A x b , x 0 . (31)

Dual problem

min y b T y (32)
s . t . A T y c , y 0 . (33)

5.1 Uncapacitated Facility Location

Let D be the set of clients and F be a set of facilities. There is a close f i to open facility i .

Goal: choose a subset of F F that minimizes i F f i + j D min i F c i j .

c i j should satisfy the triangle quality.

We find an equivalent integer programming:

min x , y i F f i y i + i F , j D c i j x i j
s . t . i F x i j = 1 , j D ,
x i j y i , i , j ,
x i j , y i { 0 , 1 } . (34)

This problem is in general NP-hard. Hence we relax the integer constraint to x i j , y i [ 0 , 1 ] .

How to round a solution of the linear program to an integer solution?

Fix an order j 1 , , j k for clients. Let N ( j ) = { i : x i j > 0 } . We say j and j are close if N ( j ) N ( j ) .

Dual problem:

max v j D v j (35)
s . t . v j c i j + w i j , i F , j D . (36)
j D w i j f i , i F (37)
w i j 0 . (38)

Greedy deterministic LP rounding

  • Solve Primal and dual LPs to obtain x i j * and v j *

  • Order the set D of users, according to increasing v j * .

    For a user in D ,

  • Let j be the user of smallest v j * , Let i N ( j ) minimize f i . Open facility i , assign it to user j . Remove j from D .

  • j D , N j N j , assign j to i . Remove j from D .

Analysis.

Note that i N ( j k ) x i j k * = 1 , we have

k f i k = k f i k i N ( j k ) x i j k * (39)
k i N ( j k ) f i x i j k * ( f i k f i , I N ( j k ) )
k i N ( j k ) f i y i * (LP constraint)
i f i y i * . (40)

Suppose a user is assigned to facility i k . Let h N ( l ) N ( j k ) .

c i k l c i j k + c h j k + c h l (41)
v j k * + v j k * + v l * . (42)
3 v l * . (43)

This is because complementary slackness implies

if x i j * > 0 v j * = c i j + w i j c i j . (44)

Therefore,

j c g ( i ) j 3 j v j * 3 Z L P * (45)
c o s t = f a c i l i t y + c l i e n t c o n n e c t i o n i f i y i * + 3 Z L P * 4 Z I P * . (46)

6 Semi-definite programming

We consider the Max-Cut problem on a graph G = ( V , E ) . We want to choose a subset U V , to maximize | E ( U , V \ U ) | , i.e., to maximize the number of edges connecting a node in U to a node outside U .

Integer program for Max-Cut

max ( u , v ) E z u v
s . t . z u v x u + x v
z u v ( 1 x u ) + ( 1 x v )
z u v , x u { 0 , 1 } (47)

Randomly cutting the graph gives us a 1 / 2 approximation.

max ( u , v ) E 1 x u x v 2 (48)
s . t . x i 1 , 1 , i V , . (49)

6.1 SDP definition

Definition 6.1 (Positive semi-definite matrix).

Let X n × n be symmetric. X is positive semi-definite or X 0 , if the following equivalent statements are true:

  • a n , a T x a 0 ,

  • X = B T B for some B .

  • All eigenvalues of X are nonnegative.

Definition 6.2 (SDP standard form).
min X n × n t r ( C T X )
s . t . t r ( A i T X ) = b i , X 0 . (50)

Dual problem

max y b T y (51)
i = 1 m y i A i + S = C (52)
S 0 (53)

To derive this, remember that we want

C , X i y i A i , X = b T y . (54)

so C i y i A i , X 0 for all X 0 , which means C i y i A i 0 .

6.2 Strong duality

Theorem 6.3 (Slater’s condition).

Strong duality holds for SDP problems if the feasible region has an interior point.

6.3 Max Cut

Now we relax the max cut problem to a SDP problem.

max u , v E 1 x u , x v (55)
s . t . x i 𝕊 n 1 . (56)

Then let A be the adjacency matrix and X = i j x i x j T be the gram matrix. Then this is a SDP.

Theorem 6.4 (This is an SDP!).

Let A be the adjacency matrix, A i j = 1 if ( i , j ) E and A i j = 0 otherwise. We define X to be the gram matrix X i j = x i T x j . Then M a x C u t is equivalent to the following SDP

min X , A = t r ( A T X )
s . t . X 0 , X 𝕊 n
X i i = 1 , i . (57)
Theorem 6.5 (Goemans-Williamson).

α g w = min 0 θ π 2 π θ 1 cos θ 0.87856 .

Solve the SDP to obtain X . Then X u v = x u , x v . Find x u by decomposing X = U T U . Choose a vector a uniformly on the sphere 𝕊 n 1 .

Set x u = s i g n ( a , x u ) .

6.4 Quadratic programs

O P T = max x i , y i { ± 1 } i j A i j x i y j . (58)

This problem is hard. Specifically, it is harder than Max-cut, i.e., max x i { ± } 1 2 ( 1 x i x j ) . To see this, let A i i be extremely large, so x i = y i must hold. Then Max-cut is reduced to this problem.

Relaxtion

O P T = m a x u i = 1 , v i = 1 A i j u i , v j = A i j z i j . (59)

Let B = ( C Z Z T D ) 0 where C , D has diagonal entries to be 0. Since B = W W T , for W = ( u v ) , B 0 .

This is an SDP program.

Theorem 6.6 (Grothendieck’s inequality).
O P T O P T π 2 ln ( 1 + 2 ) O P T . (60)

We first try the hyperplane rounding we used in Goemans-Williamson,

SDP contributes: A i j u i , v j = A i j cos θ i j .

Rounding contributes: A i j ( 1 ( θ i j / π ) 1 ( 1 θ i j / π ) ) = 2 A i j θ i j π

However, A i j could be negative, making this bound infeasible.

Fortunately, we have the following lemma, which directly proves the theorem. To see this, we first apply this transformation in the lemma and then do the hyperplane rounding.

Lemma 3.

For any unit vectors u 1 , , u m , v 1 , , v n , then there exists a new set of unit vectors u 1 , , u m , v 1 , , v n , s.t.

𝔼 a [ s i g n ( a T u i ) s i g n ( a T v j ) ] = 2 π ln ( 1 + 2 ) u i , v j . (61)
Proof.

Define tensor product U 2 = [ u 1 2 , u 1 u 2 , , u n 2 ] . Let c = sinh 1 1 = ln ( 1 + 2 ) .

By Taylor’s expansion,

sin c u , v = k = 0 ( 1 ) k c 2 k + 1 ] ( 2 k + 1 ) ! u , v 2 k + 1 (62)

Let

u = [ u , ( 1 ) c 3 3 ! u 3 , , ( 1 ) k c 2 k + 1 ( 2 k + 1 ) ! u 2 k + 1 ] (63)

and

v = [ v , c 3 3 ! v 3 , , c 2 k + 1 ( 2 k + 1 ) ! v 2 k + 1 ] (64)

Then u , v = sin ( c u , v ) .

And u , u = sinh ( c ) = 1 . This is because removing the sign in the Taylor’s expansion yields sinh ( c u , v ) .

Then

𝔼 a [ s i g n ( a T u ) s i g n ( a T v ) ] = 2 π arcsin u , v = 2 π c u , v (65)

Lavasz theta function

Let α ( G ) be the size of the largest independent set of G .

Let χ ( G ) be the chromatic number of G . And let G ¯ be the complement of G .

Then

χ ( G ¯ ) α ( G ) . (66)

We define

θ ( G ) := min k , (67)
s . t . v i , v j = 1 k 1 ( i , j ) E , (68)
v i , v i = 1 . (69)
Theorem 6.7.

α ( G ) θ ( G ) χ ( G ¯ ) .

Proof.

First θ ( G ) χ ( G ¯ ) .

Fact: k unit vectors u 1 , , ˙ u k such that u i , u j = 1 k 1 .

Then any k -coloring of G ¯ yields u 1 , , u k .

The other direction: θ ( G ) α ( G ) . We solve the SDP to obtain v i . For an independent set, consider v 1 , , v s of the independent set

0 ( i = 1 s v i T ) ( i = 1 s v i ) i = 1 s v i T v i + i j v i T v j (70)

Then at least one term of v i T v j s 2 ( s 2 ) = 1 s 1 . However, v i , v j = 1 θ ( G ) 1 . ∎

7 Online Algorithms

7.1 Multiplicative Weights

We wish to make predictions over T days, relying on n expects that we have. Define f i t to be the loss of expert i , on day t . Without loss of generality, f t ) 1 .

Our goal is to minimize the regret

R e g r e t = t = 1 T p t , f t min u t = 1 T f i t . (71)

Naive idea: to pick the best expert so far. Then there exists a set of f ( t ) so that the regret T ( 1 1 n ) .

Weighted Majority. Assuming we are making binary choices. Assign each expect i with weight w i ( 1 ) = 1 . We can make a majority vote and predict

x = i = 1 n w i x i i = 1 n w i . (72)

Assumption: There is an expert who is always correct.

Algorithm: Remove the experts with the wrong predictions. Then each time we make a mistake, we halve the number of experts. Regret = O ( log n ) . This is basically the idea of the multiplicative weights algorithm.

If we remove the assumption that there is an expert always correct, we can restart after removing all experts. The bound is O ( ( M + 1 ) log n ) if the best expert makes M mistakes.

The idea of multiplicative weight on binary outcomes

  • Set w i ( 1 ) = 1 .

  • After observing the outcome at day t , set w i ( t + 1 ) = w i ( t ) 2 if i makes a wrong prediction.

Analysis: let Φ ( t ) = i = 1 n w i ( t ) . Then

Φ ( T + 1 ) ( 1 2 ) # i s mistakes (73)
Φ ( t + 1 ) 3 4 Φ ( t ) , (when the algorithm makes a mistake) (74)

Therefore,

# mistakes 1 log 4 / 3 ( # i s mistakes + log n ) (75)

However, this does not guarantee a bound on the regret, due to the log 4 / 3 constant. We now state the full version of multiplicative weights.

Multiplicative weights algorithm

  • Set w i ( 1 ) = 1 .

  • For t = 1 , , T ,

    • Follow expert i with probability p i t = w i t i w i t .

    • w i ( t + 1 ) w i ( t ) ( 1 ϵ f i t ) , i .

Theorem 7.1.

If 0 < ϵ 1 2 , then the multiplicative weight gives the following bound on regret

r e g r e t log n ϵ + ϵ T . (76)

If T is known, ϵ = log n T , regret 2 T log n .

Proof.

Define a potential function Φ t = i w i ( t ) . Then

Φ ( t + 1 ) = i w i ( t ) ( 1 ϵ f i t ) = i p i t Φ t ( 1 ϵ f i t ) . (77)

Therefore,

𝔼 Φ ( t + 1 ) = 𝔼 Φ ( t ) ( 1 ϵ 𝔼 t ) Φ t exp ( ϵ 𝔼 t ) . (78)

Hence 𝔼 Φ ( T ) n exp ( ϵ 𝔼 L 1 : t ) .

On the other hand, let i be any expert,

𝔼 Φ ( T ) w i T t = 1 T exp ( ϵ f i t ϵ 2 ( f i t ) 2 ) . (79)

Then let i be the optimal expert,

𝔼 [ L 1 : t ] O P T 1 ϵ ( log n + ϵ 2 t = 1 T ( f i t ) 2 ) log n ϵ + ϵ T . (80)

Remark. When f t ρ , regret ρ 2 log n ϵ + ϵ T .

7.2 Application of Multiplicative Weight

Zero-sum Games.

v = min x max y c ( x , y ) (81)
Theorem 7.2 (Von Neumann).

Let x , y be mixed strategies of players A and B , then

min x max y c ( x , y ) = max y min x c ( x , y ) . (82)

Another way of phrasing: ( x , y ) , x , y 0 , i x i = i y i = 1 , then there is a value v ,

x T M v , M y v , w h e r e M i j = c ( i , j ) . (83)

We try to prove this theorem by multiplicative weights. Let ( p 1 , q 1 ) , , ( p T , q T ) be the strategies over T days.

Then

( t = 1 T p t , A q t max i t = 1 T ( A q t ) i ) ln m ϵ + ϵ T (84)

and

t = 1 T q t , A p t max i t = 1 T ( A p t ) i ln m ϵ + ϵ T (85)

Adding them together,

max i t = 1 T ( A q t ) i min j t = 1 T ( p t T A ) j 2 ln m ϵ + 2 ϵ T = 2 l o g n T . (86)

Let P ¯ = 1 T i = 1 T p t , q ¯ = 1 T i = 1 T q t , then

max i ( A q ¯ ) i min j ( q ¯ T A ) δ = 2 log n T . (87)

This means

min j ( p ¯ T A ) j P ¯ T A q ¯ max i ( A q ¯ ) i min j ( p ¯ t A ) j + δ (88)

When T goes to infinity, the two bounds are equal, which proves our theorem.

7.3 Application: Max Flow

Consider an unweighted graph G = ( V , E ) . The max flow problem can described by a linear program:

Let P s , t be the set of all paths from s to t , then

max p P s , t x ( p ) . (89)
s . t . p : e p x ( p ) 1 , e (90)
x ( p ) 0 . (91)

The dual problem:

min e e ( e ) (92)
s . t e p ( e ) 1 , p P s , t (93)
( e ) 0 . (94)

Zero sum game conversion:

Let γ be the optimal flow, a player P chooses a path, and a player D chooses an edge (might be a mixed strategy). Payoff for D is 1 if e P and 0 otherwise.

Lemma 4.

Let v be the value of the game. Then v = 1 γ .

Proof.

Given an optimal solution ( e ) to the dual, D plays edge e with probability ( e ) e ( e ) = ( e ) γ , then for all paths, the payoff for D is

e P ( e ) γ = 1 γ e P ( e ) 1 γ . (95)

Let x ( p ) be the optimal solution to the primal problem. P chooses a path p with probability x ( p ) γ . For any edge,

Pr [ e p ] = p : e p = 1 γ p : e p x ( p ) 1 γ . (96)

Remark. We now want to run a multiplicative weights algorithm on this zero-sum game, but the problem is that the primal problem has exponentially many variables. We have make some modification to fix this issue.

  • For each t = 1 , , T , use multiplicative weights to choose a distribution w t on edges. Let P t be the best response to w t , which is to find the shortest path algorithm with weights w t .

  • Set the reward vector to be: r t ( e ) = 𝟙 [ e P t ] .

  • Suppose the solution is p 1 , , p t , we route γ T units of flow on each p i . Denote this route as f .

Lemma 5.

This algorithm routes 1 + δ units on each edge, for T = 4 γ 2 ln m δ 2 .

Proof.

Suppose for contradiction that e f routes more than ( 1 + δ ) on e , then there is more than ( 1 + δ ) T γ of the path p 1 , , p T use edge e .

If the D player players edge e in hindsight, then the payoff is more than ( 1 + δ ) γ .

In each step, player D gets at most 1 γ in expectation, because P t is the best response.

Note that δ γ R e g r e t T 2 T log n T . ∎

7.4 Application: Adaboost

Goal: to learn an unknown function X { 0 , 1 } , given a sequence of training samples ( x , c ( x ) ) , x D . We want to minimize

𝔼 x D [ | h ( x ) c ( x ) | ] . (97)

Weak learner: does slightly better than randomly guessing, with loss 1 2 γ .

We consider samples in the training set as experts.

For hypothesis h , the penalty for expert x is 0 if h ( x ) c ( x ) and 1 otherwise. (Intuition: we want to sample more on hard samples).

In each round, the algorithm gives a distribution D t over experts and obtains a hypothesis h t which is a weak learner for D t , i.e., the penalty M ( D t , h t ) 1 2 + γ .

The final hypothesis h f i n a l labels x according to a majority vote over h 1 ( x ) , , h T ( x ) .

Analtsis:

Let S be the set of training samples labeled incorrectly by the final hypothesis.

Penalty for each x S , then since h f i n a l ( x ) is a majority vote,

t M ( x , h t ) T 2 . (98)

Let the potential function be defined by Φ t = x w t .

Then

Φ T Φ 1 e α t M ( D t , h t ) n e 0 α T ( 1 2 + γ ) (99)
Φ T x S w x x S ( 1 α ) t M ( x , h t ) | S | ( 1 α ) T / 2 (100)

Then we derive T = 2 γ 2 log ( 1 ϵ ) , where ϵ = | S | n

8 Spectral Method

8.1 Sparsest cut

Definition 8.1 (Edge expansion).

Let G = ( V , E ) be an undirected graph, and S V be a subset of vertices. The edge expansion of S is

ϕ ( S ) = E ( S , V S ) d ( S ) (101)

where d ( S ) = v S d ( v ) be the total degree of the vertices in S .

The edge expansion of a cut ( S , V S ) is max { ϕ ( S ) , ϕ ( V S ) } .

The edge expansion of a graph is

ϕ ( G ) = min S : 0 < | S | < | V | ϕ ( S , V S ) = min S : d ( S ) d ( V ) 2 , | S | 0 ϕ ( S ) . (102)

Sparsest cut: Compute ϕ ( G ) .

Example.

  • Cycle: 2 / n

  • Clique: 1 / 2 .

  • Barbell: 1 / n 2

Let A be the adjacency matrix of G . Suppose G is a d -regular graph. The normalized Laplacian is defined as

L = I 1 d A . (103)

We compute the eigenvalues λ 1 λ 2 λ n , then the Cheeger inequality states that

λ 2 ϕ ( G ) 2 λ 2 . (104)
Lemma 6.

Let M n × n be a symmetric matrix and λ 1 λ n be eigenvalues of M . Then

λ k = min k d i m V max x V { 0 } x T M x x T x . (105)

Fact. If x 1 is an eigenvector of λ 1 , then

λ 2 = max x 0 , x x 1 x T M x x T x (106)

Note that the Laplacian of a d -regular matrix is L = d I A .

x T L x = ( u , v ) E ( x u x v ) 2 . (107)

Let L ¯ = L / d .

Theorem 8.2.

Let G be a d -regular, undirected graph, let λ 1 λ n be the eigenvalues of L ¯ . Then

  • λ 1 = 0 and λ n 2 .

  • λ k = 0 if and only if G has at least k connected components.

  • λ n = 2 if and only if at least one of the connected components of G is bipartite.

Proof.

i. Trivial.

ii. λ k = min k d i m S max x S { 0 } ( u , v ) E ( x u x v ) 2 d v x v 2 . If λ k = 0 .

If λ k = 0 , there exists a k -dimensional S , x S , x u = x v for all ( u , v ) E . This means x must be constant within each connected component. Therefore, the dimension of S can be at most the number of connected components. The reverse direction can be proved by letting S be the space of vectors that are constant within each connected component.

iii. λ n = max x 0 x T L ¯ X x T x . Note that

2 x T x x T L ¯ x = 1 d ( u , v ) E ( x u + x v ) 2 . (108)

Therefore,

λ n = 2 min x 0 ( u , v ) E ( x u + x v ) 2 d v x v 2 . (109)

If λ n = 2 , then there is a vector x 0 , with ( u , v ) E ( x u + x v ) 2 , so for all ( u , v ) E , x u = x v . Let A = { v : x v < 0 } , B = { u : x u > 0 } . Then A B must be disconnected from { v : x v = 0 } . Moreover, A B is bipartite. ∎

8.1.1 Generalize to general graphs

L = D A where D = d i a g ( d 1 , , d n ) . L ¯ = I D 1 / 2 A D 1 / 2 .

Then we generalize

( u , v ) E ( x u x v ) 2 d ( u , v ) E ( x u d u x v d v ) 2 . (110)

8.1.2 Proof of Cheeger’s theorem

We want to prove λ 2 2 ϕ ( G ) 2 λ 2 .

The easy part: λ 2 2 ϕ ( G ) .

Note that λ 2 (To fill.)

The hard part: ϕ ( G ) 2 λ 2 .

Lemma 7.

Given any y v , there exists an algorithm to find S s u p p ( y ) with

ϕ ( S ) 2 y T L y y T D y . (111)
Proof.

We can solve the min D 1 / 2 y v 1 y T L y y T D y problem efficiently when y [ 1 , 1 ] . The problem is how should we round the solution.

Algorithm: sweep cut. Choose a threshold t [ 0 , 1 ] at random, and output S + = { i V | y i 2 > t } . This procedure can be simply derandomized by enumerating all n 1 possible cuts.

𝔼 t | E ( S t , S ¯ t ) | = ( i , j ) E Pr [ ( i , j ) cut by S t ] (112)
= ( i , j ) E | y i y j | ( y i + y j ) (113)
( i , j ) E ( y i y j ) 2 ( i , j ) E ( y i + y j ) 2 (114)
( i , j ) E ( y i y j ) 2 2 i V d ( i ) y i 2 (115)

The first term is just the numerator. Note that 𝔼 t [ d ( S t ) ] = i V d ( i ) 𝔼 t [ 𝟙 ( i S t ) ] = i V d ( i ) y i 2 .

Therefore,

𝔼 t [ | E ( S t , S ¯ t ) | ] 𝔼 t [ d ( S t ) ] 2 y T L y y T D y . (116)

This means

𝔼 t [ | E ( S t , S ¯ t ) | 2 y T L y y T D y | d ( S t ) | ] 0 . (117)

It means there exists a t * such that

| E ( S t * , S ¯ t * ) | [ d ( S t * ) ] 2 y T L y y T D y . (118)

Now we only need to show a construction of z from y , which satisfies R ( z ) R ( y ) , and at the same time

i : z i 0 d ( i ) d ( V ) 2 . (119)

Since D 1 / 2 y v 1 we know i d ( u ) y i = 0 . Suppose y 1 y 2 y n WLOG. Find the smallest j such that

1 i j d ( i ) d ( V ) 2 . (120)

and let z = y y j . Let z + = max ( 0 , z ) and z = min ( 0 , z ) . Then we know

i : z i 0 d ( i ) d ( V ) 2 and i : z i + 0 d ( i ) d ( V ) 2 (121)
Lemma 8.

min { R ( z ) , R ( z + ) } R ( z ) R ( y ) .

Proof.

First, R ( z ) R ( y ) because their numerators ( z i z j ) 2 = ( y i y j ) 2 and their denominators i d ( i ) z i 2 = i d ( i ) y i 2 + i d ( i ) y j 2 2 y i i y i d ( i ) = i d ( i ) y i 2 + i d ( i ) y j 2 i d ( i ) y i 2 . Note that i y i d ( i ) = 0 is derived from D 1 / 2 y v 1 .

Moreover,

z T D z = z + T D z + + z T D z (122)

and

z T L z z + T L z + + z T L z (123)

Then

z + T L z + + z T L z z + T D z + + z T D z R ( z ) . (124)

which means one of R ( z ) and R ( z + ) must be less than R ( z ) . ∎

Finally, we conclude that ϕ ( G ) 2 λ 2 since

ϕ ( G ) = min S : d ( S ) d ( V ) 2 , | S | 0 ϕ ( S ) . (125)

8.2 Spectal Clustering

Problem: input a set of points a i with a measure of similarity w i j 0 . How do we cluster them?

  • Embed into d by using eigenvectors of L ¯ . Suppose ( v 1 , , v d + 1 ) are the d smallest eigenvalues. We map a i to ( ( v 2 ) i , , ( v d + 1 ) i ) .

  • When d = 2 , it is equivalent to our sweep cut algorithm.

9 Random Walk

2-SAT problem.

Start with a random assignment. At each step, choose a clause that is not satisfied, and flip a variable in the clause.

Analysis.

Consider the Hamming distance r [ 0 , n ] between the current assignment to the best. ( n is the number of variables). The hitting time to r = 0 should be O ( n 2 ) .

9.1 Random walk on graphs

Let G = ( V , E ) be an undirected graph.

At each time, a random walk is at some node i V . At time t + 1 , the random walk chooses a neighbor of i at random and moves to that neighbor.

A “lazy” random walk: stays at i with probability 1 / 2 .

Stationary distribution.

  • Does there exist one?

  • If exists, how long does it take to converge? (mixing time)

Let p ( t ) be the probability distribution at time t .

p t + 1 ( i ) = j : ( i , j ) E p t ( j ) 1 d ( j ) . (126)

We can write this as a matrix multiplication:

p t + 1 = A D 1 p t , (127)

where D = diag ( d 1 , , d n ) , and A is the adjacency matrix. A D 1 is called the transition matrix.

Definition 9.1 (Stationary distribution).

A probability distribution π over V is a stationary distribution if

π = ( A D 1 ) π . (128)

We can directly write out a stationary distribution

π = ( d 1 , , d n ) 2 m , where m = | E | . (129)
Definition 9.2 (ergodic).

A random walk is called ergodic if there is a distribution π , such that for any initial distribution p 0 ,

lim t p t = π . (130)
Theorem 9.3.

A random walk is ergodic if and only if it is connected and not bipartite.

Lazy random walk:

p t + 1 ( i ) = 1 2 p t ( i ) + 1 2 ( i , j ) E p t ( j ) d ( j ) . (131)

and in matrix forms:

p t + 1 = ( 1 2 I + 1 2 A D 1 ) p t . (132)
Theorem 9.4.

A lazy random walk is ergodic if and only if G is connected.

Proof of theorem 9.3 and 9.4 for d -regular graphs..

Let α 1 α n be the eigenvalues of A ¯ = A D 1 , and x 1 , , x n be the corresponding eigenvectors.

From theorem 8.2,

  • α 1 = 1 , x 1 = 1 / n

  • α 2 < 1 if G is connected.

  • α n 1 . If G is not bipartite if and only if α n > 1 .

Let p 0 = i = 1 n c i x i , where c i = x i , p 0 . Then

p t = i = 1 t c i α i t x i . (133)

If G is connected and not bipartite, then | α i | < 1 , for all i 1 . Thus, α i t 0 as t .

lim t i = 1 n c i α i t x i = c 1 x 1 = 𝟏 n . (134)

For lazy random graph, w i = 1 2 ( 1 + α i ) [ 0 , 1 ) when G is connected. So w i t 0 as t . ∎

9.2 Mixing time

Definition 9.5 (Mixing time).

The smallest time t such that D T V ( p t , π ) 1 / 4 , where π is the stationary distribution.

Definition 9.6 (Spectral gap).

We define the spectral gap of a graph by

α = min { 1 α 2 , 1 | α n | } . (135)

For regular random walks,

p t = i = 1 n c i α i t x i = 1 n + i = 2 n c i α i t x i . (136)

Then

D T V ( p t , π ) = i = 2 n c i α i t x i 1 (137)
n i = 2 n c i α i t x i 2 (138)
n ( 1 α t ) ( i = 2 n c i 2 ) 1 / 2 n ( 1 α t ) . (139)

10 Expander Graph

Definition 10.1.

A two-sided spectral expander ( d , γ ) is a graph, such that

  • G is d-regular

  • i 2 | λ i ( L ¯ ) 1 | γ

  • Explicit: If it takes p o l y ( n ) time to produce A .

  • Strongly explicit: If it takes time log n to produce the neighbor of a given vertex.

Remark. For complete graphs, λ 2 ( L ¯ ) = 1 / 2 . For a path, λ 2 ( L ¯ ) = 1 / n 2 .

Proposition 10.2 (Expanders are reliable networks.).

For expander graphs, if k edges are removed, the largest connected component is n k d ϕ .

10.1 Random walks on expander graphs

On a complete graph, it takes O ( k log n ) bits to represent a k -steps random walk. On an expander, it takes log n + k d = log n + O ( k ) .

10.2 Application: Pseudo-random number generation

Syooise we are given an algorithm with a constant error rate. We can decrease its error rate to exponentially small by repeating polynomial times.

Using Chernoff bound and repeat t times, the error rate is C t . Suppose the algorithm requires r random bits. The repeat-Chernoff requires r t random bits. However, with expanders, we reduce the number of random bits to r + 10 t .

Algorithm.

  • Choose v { 0 , 1 } r at random.

  • Take t 1 steps running a random walk on ( V = { 0 , 1 } r , E ) to generate v 2 , , v t .

Note that a strongly explicit graph is required, as the number of vertices is exponentially large.

Expander graph on V , d = 400 , i 2 , | μ i | | d 1 10 . ( μ i is the adjacency).

Let | X | { 0 , 1 } r be the set that the algorithm is incorrect on. | X | 2 r 100 . Let S = { i : v i X } . Let Y = V X .

p 0 = 1 n . The characteristic vectors of X and Y is defined χ x and χ y .

Let D x = diag ( χ x ) and D y = diag ( χ y ) . W = A d is the transition matrix. The probability that a vertex chosen according to p in X is

χ x T p = 1 T D x p . (140)

q = W D x p 0 is the probability that a walk starts at a vertex in X and goes to q . The probability that the walk is in X at time i R is ( R is an arbitrary subset of time in 1 , , t ) We want to show

1 T D Z t W D Z t 1 W D Z 1 W D Z 0 p 0 1 5 | R | . (141)

where Z i = X is i R and Y otherwise.

Lemma: D X W 1 / 5 .

Then

1 T D Z t W D Z t 1 W D Z 2 W D Z 0 p 0 = 1 T ( D Z t W ) ( D Z t 1 W ) ( D Z 1 W ) ( D Z 0 W ) p 0 ( 1 / 5 ) R p 0 (142)

Then,

Pr [ | S | > t 2 ] | R | > t 2 Pr [ walks is in  X  at times R ] ( 2 / 5 ) t + 1 (143)

For lemma: Let v = c 1 𝟏 + y , with 1 T y = 0 . Then v max { c 1 n , y } .

D x W v v (144)

11 Hardness Assumption

11.1 Encryption

Alice sends a message b to Bob. Bob should be the only person who can decode the message.

Private-key encryption

  • Shared key: random bit c { 0 , 1 } .

  • To send a message b , Alice sends b c . (One time pad).

Public-Key encryption

  • Key generation: Generate a public key and a secret key. Publish p k .

  • Encryption: take randomness r , E n c ( b , p k , r ) = c

  • Decryption: D e c ( p k , s k , c ) = b .

Definition 11.1 (Computationally indistinguishable).

We say ( p k , E n c ( p k , r , 0 ) ) ( p k , E n c ( p k , r , 1 ) ) computationally indistinguishable, if:

If there exists an efficient algorithm that distinguishes E n c ( 0 ) , E n c ( 1 ) , then some underlying hardness assumption is broken.

11.2 Hardness

Definition 11.2 (Learning with errors).

The LWE problems are defined by

  • A q m × n with m n log q .

  • A is sampled uniformly at random,

  • Consider s q n . Given A s , solving s is easy.

  • Given A s + e , e q m and e = O ( n ) .

Assumption 11.3 (Decisional LWE assumption).

Let e sampled from a truncated Gaussian, | e | σ and σ 2 n .

( A , A s + e ) ( A , U n i f ( q m ) ) (145)

11.3 Encryption scheme from LWE.

Private-key scheme:

  • Shared key s q n .

  • To encrypt a message b , pick A at random and pick e .

    c = A s + e + ( 0 , , 0 , b q / 2 . ) (146)
  • Send ( A , c ) .

  • To decrypt, output r o u n d i n g ( c A s ) .

Public-key scheme:

The only difference is that we treat A , A s + e as the public-key, and treat s as the secret key.

  • Randomness r { 0 , 1 } m . Encrypt: ( r T A , r T ( A s + e ) + b q / 2 )

  • Decrypt ( c 1 , c 2 ) : round c 2 c 1 T s .

Security of the Public-key scheme.

We can show r T [ A u ] + ( 0 , , 0 , b q / 2 ) is indistinguishable to uniform by the following lemma.

Lemma 9 (Leftover Hash).

For m n log n , and r uniformly at random, then the following two distributions are statistically close:

( A , r T A ) , ( A , u ) . (147)

11.4 Worst-case to average-case reduction

Start with a distribution D over s , such that ( A , A s + e ) is hard.

Sample s ^ uniformly at random, ( A , A s + e + A s ^ ) .

12 Quantum Computing

Consider qubit | 0 = ( 1 0 ) , and | 1 = ( 0 1 ) .

Definition 12.1 (Quantum circuits.).

A quantum circuit is defined by a unitary matrix U ( U T U = I ), so that

ψ | U T U | ψ = ψ | ψ = 1 . (148)
Definition 12.2 (Universal quantum gates).
  • Toffoli: T | a , b , c = | a , b , a b c .

  • Hadamard: H | 0 = 1 2 ( | 0 + | 1 ) , and H | 1 = 1 2 ( | 0 | 1 .

One can check that both gates are unitary.

12.1 Efficient period finding

Simon’s problem. Input a black-box function f : { 0 , 1 } n { 0 , 1 } n . Promise that there exists s { 0 , 1 } n , f ( x ) = f ( x + s ) . How to find s ?

Classical algorithms take exponential time.

Quantum speedup.

  • Create a uniform superposition: 1 2 n x | x by H | 0 .

  • Apply f in superposition:

    1 2 n x | x | 0 n 1 2 n x | x | f ( x ) . (149)

    Note that this operation is unitary (reversible) as we can subtract f ( x ) to recover.

  • Measure the second register. If we measure y { 0 , 1 } n , the probability of y is 2 1 2 n , and the remaining state

    1 2 ( | x + | x s ) , s . t . f ( x ) = f ( x + s ) = y . (150)
  • Apply H n .

    1 2 2 n ( y ( 1 ) x y ( 1 + ( 1 ) y s ) | y ) (151)

    Notice that 1 + ( 1 ) y s = 0 if y s = 1 .

  • Sample many y to get linear equations y s = 0 . Solve s .

12.2 Reduction of factorization to period finding

Factoring: N = p q , find p , q .

  • Find a < N with ( a , N ) = 1 .

  • Construct function f ( x ) = a x ( mod N ) . So f ( x ) = f ( x + r ) if and only if x r = 1 ( mod N ) . We can thus find r .

    Lemma 10.

    If N = p q , a < N chosen uniformly at random and ( a , N ) = 1 , then with probability 1 / 2 , r is even and a r / 2 ± 1 ( mod N ) .

  • ( a r / 2 + 1 ) ( a r / 2 1 ) = 0 ( mod N ) . Then g c d ( N , a r / 2 + 1 ) is a nontrivial factor of N .