Steven Soojin Kim

Thinning PRMs

2016-03-06T00:00:00-05:00

Let $X$ be a Poisson random variable with rate $\lambda$, and let $Y$ be an independent Poisson random variable with rate $\mu$. Then, simple calculations show that $X+Y$ is a Poisson random variable with rate $\lambda + \mu$. This is known as the superposition property of the Poisson distribution. In the opposite direction, suppose $X$ is a Poisson random variable with rate $\lambda$, and let $Z_1,Z_2,\dots$ be an independent sequence of i.i.d. Bernoulli random variables with probability $p$. Then, $\sum_{i=1}^\infty Z_i \mathbb{I}_{\{i \le X\}}$ is Poisson distributed with rate $\lambda p$. This is known as the thinning property. It turns out that this latter property is far more general, and the goal of this post is to illustrate thinning for general Poisson random measures.

For a given $\sigma$-finite measure space $(E,\mathcal{E},\mu)$, the Poisson random measure (PRM) with intensity $\mu$ is a random measure $\omega\mapsto N_\omega(\cdot)$ defined on some probability space $(\Omega,\mathcal{F},\mathbb{P})$ such that:

for all $\omega\in\Omega$, $N_\omega(\cdot)$ is a measure on $(E,\mathcal{E})$;
for all $A \in \mathcal{E}$, the random variable $\omega\mapsto N_\omega(A)$ is Poisson distributed with rate $\mu(A)$;
if $A_1,\dots, A_k \in \mathcal{E}$ do not intersect, then $N(A_1),\dots,N(A_k)$ are mutually independent.

(In the remainder of this post, I’ll omit $\omega$).

Example. Suppose $E=\mathbb{R}_+$, let $\mathcal{E}$ be the Borel $\sigma$-algebra, and $\mu(dx) = \lambda \,dx$, where $dx$ is the Lebesgue measure. Then, the process $t \mapsto N([0,t])$ is a Poisson process with rate $\lambda$.

More generally, (integrals with respect to) Poisson random measures offer a convenient way to describe and represent stochastic processes with jumps. Compare to the setting of processes with continuous paths, where the Ito integral plays the corresponding role.

One approach to proving large deviations results for stochastic processes is to characterize certain changes of measure as “control” problems for random paths. In the continuous setting, the Girsanov theorem indicates that such changes of measure are achieved by imposing drift via adapted processes. In the jump setting, such “control” is achieved through a generalization of the “thinning” mechanism described above.

Instead of getting into details, I’d like to just visualize this thinning mechanism in a very simple case: with constant “control”. Consider the plot below, which displays the outcome of a Poisson random measure $N$ on $[0,40] \times [0,1]$, with intensity the Lebesgue measure. The blue and grey points together represent the outcome of the Poisson random measure $N$. The blue dots alone represent a thinning of $N$; that is, the outcome of the PRM $\mathbb{I}_{[0,c]}(x) N(dt\, dx)$, where in the plot below, I have chosen $c=0.65$. In the second plot are the associated homogeneous Poisson processes $t\mapsto N([0,t]\times [0,1])$, again with grey representing the total process (with rate 1) and blue representing the thinned process (with rate 0.65). In particular, every jump of the blue path (respectively, grey path) corresponds to a blue point (respectively, a blue or grey point). Lastly, the dashed lines represent the “average” behavior of the associated Poisson processes.

For reference, the plot was produced in Python with matplotlib, and then ported to the web with mpld3. To zoom in and pan, use the icons in the lower left of the figure. The code can be found in my Github repo.

Thinning PRMs was originally published by Steven Soojin Kim at Steven Soojin Kim on 2016.03.06.

Control theory today

2015-10-10T00:00:00-04:00

Optimal control theory is a rich mathematical field with a surprisingly interesting history. It dates back to the brachistochrone problem of Johann Bernoulli in 1696, but it genuinely boomed during the Cold War through independent developments in Soviet Union (Steklov Institute) and the United States (RAND Corporation). But at some point, I came across Yu-Chi Ho’s blog post, from 2010, where he reports the bold pronouncement of an NSF program director: “Control is dead!”

Professor Ho explains that perhaps “mature” is a better word, but even this might be seen as a strong claim. As a graduate student, I am in no place to judge the “life” or “death” of such a broad field, but my bias towards my home department compels me to promote and celebrate control theory, which played an important role in the growth of the Division of Applied Mathematics and the development of the associated Lefschetz Center for Dynamical Systems. That is, I would like to believe that this significant part of Brown’s history still plays a serious role in the mathematical community today.

The goal of this post is to point out some modern appearances of control theory, particularly in seemingly unexpected areas (at least, unexpected to this amateur author). Of course, control theory has long played a leading role in applied probability: e.g., in financial engineering, queueing networks, and filtering. But these are somewhat well-recognized as the usual stomping grounds of control theory, and I would like to highlight some other (possibly more surprising) connections.

In particular, the first three applications described below invoke the following variational formula from Boué, Dupuis (AoP’98). For $W$ a standard $d$-dimensional Brownian motion on $[0,1]$, and $f: C([0,1];\mathbb{R}^d) \rightarrow \mathbb{R}$ measurable and bounded from above, we have

\[-\log \mathbb{E} e^{-f(W)} = \inf_{u} \,\, \mathbb{E}\left[ \frac{1}{2}\int_0^1 |u_s|^2 ds + f\left(W + \int_0^\cdot u_s\,ds\right) \right], \tag{$\star$}\]

where the infimum is over the space of controls $u$ which are progressively measurable with respect to the augmented Brownian filtration. One should view the first term in the infimum as a “running cost” for the effort exerted by the control $u$, and the second term as a “state occupation cost”. In particular, if $f$ only depends on the time 1 state of the controlled input process, then it can be interpreted as the usual “terminal cost”. Under the preceding interpretations, $-\log\mathbb{E} e^{-f(W)}$ is a representation for the value function of the associated stochastic control problem. In fact, formulas like $(\star)$ arose even earlier in the control literature; e.g., in Fleming (AMO’77).

As promised, here are a few “modern” links to control theory:

Functional inequalities: Lehec (AIHP’13) derives what is essentially the dual formulation of $(\star)$: for $\gamma$ the Wiener measure on $C([0,1];\mathbb{R}^d)$, $\begin{align} H( \mu \| \gamma ) = \min_{u} \mathbb{E}\left[\frac{1}{2}\int_0^1 |u_s|^2 ds\right], \end{align}$ where the minimum is over all controls $u$ such that the process $W + \int_0^\cdot u_s ds$ has law $\mu$. That is, the control $u$ is related to the optimal change of measure from $\gamma$ to $\mu$. Related analysis of (an) optimizing $u$ combined with basic martingale arguments yield straightforward proofs of Talagrand’s transportation cost inequality, log Sobolev inequality, and Brascamp-Lieb inequality (for the Wiener measure). Similar control-like principles (for the standard Gaussian measure on $\mathbb{R}^d$ instead of the Wiener measure on path space) are employed in Eldan, Lee (preprint’14) to establish uniform decay of the level sets of the Gaussian measure under the Ornstein-Uhlenbeck semigroup.
Spin glasses: In seminal work by Talagrand (AoM’06) and Panchenko (AoP’14), it was established that the thermodynamic limit of the free energy of the Sherrington-Kirkpatrick model (and associated mixed $p$-spin model) is given by a minimization problem involving the “Parisi functional”, the solution to a particular nonlinear PDE. Inspired by the variational formula $(\star)$, it is shown in Auffinger, Chen (CMP’14) that the Parisi functional is strictly convex, and thus a unique “Parisi measure” characterizes the limiting free energy of the SK model. The proof of strict convexity is simplified in Jagannath, Tobasco (PAMS’15), by explicitly appealing to the dynamic programming principle from stochastic control theory. This theme of a control-theoretic approach to analysis of the Parisi functional is continued in Chen (preprint’15).
KPZ and rough paths: In Section 7 of Gubinelli, Perkowski (preprint’15), the authors use a generalized version of $(\star)$ to frame the KPZ equation as the value function of a stochastic control problem. This representation is in turn used to prove certain a priori estimates which yield global existence of solutions to the KPZ equation, complementing Hairer’s approach via regularity structures.
First-passage percolation: Krishnan (preprint’14) views the first-passage time on $\mathbb{Z}^d$ as a discrete control problem, where the canonical basis vectors ${\pm e_1, \cdots, \pm e_d}$ act as the “controls” of a minimizing path between two points. This in turn yields a characterization of the associated time constant as the solution to a discrete Hamilton-Jacobi equation.

I suppose the overarching theme is that many mathematical problems are just (highly sophisticated) optimization problems which, with some work, can be massaged into control problems. In particular, the preceding examples show that adopting a control-theoretic perspective can lead to insightful, meaningful, and productive reformulations of existing problems!

Control theory today was originally published by Steven Soojin Kim at Steven Soojin Kim on 2015.10.10.

Applied Math Retreat

2015-09-22T00:00:00-04:00

This past weekend, 20 of the applied mathematics graduate students gathered in Franklin, NH for our first (annual?) department retreat, organized by Michael Snarski and myself. The goals of the retreat were to stimulate scientific interactions, introduce first-years to the mathematical energy of the department, and refresh our minds in a beautiful natural environment.

I personally had a great time! The fresh air lent the whole weekend a very positive atmosphere which seemed conducive to mathematical activity. The retreat was also a great way to catch up with what others are working on, and to get to know the new students in our department.

We spent the mornings doing math, with awesome workshops led by August Guang and Michael, and some short research talks given by Ivana Petrovic, Leroy Jia, Melissa McGuirl, Clark Bowman, and Alexandria Volkening. In the afternoons, we enjoyed the surroundings with some relaxing hikes and excursions on the lake.

Special shoutouts to: our department chair Björn Sandstede for his feedback, support, and encouragement; Gautam Kamath from MIT for the initial idea and some preliminary help with logistical aspects of the trip; Guo-Jhen Wu for a lot of behind-the-scenes help and early brainstorming; and all the participants for sharing their mathematical ideas, driving, cooking, cleaning, and basically making this weekend happen.

Here are a few pictures!

Applied Math Retreat was originally published by Steven Soojin Kim at Steven Soojin Kim on 2015.09.22.

Random Schrodinger Operators

2014-11-15T00:00:00-05:00

Consider the one-dimensional, discrete Schrodinger operator on $\ell^2(\mathbb{Z})$:

\[(Hu)(n) := u(n+1) + u(n-1) + V(n) u(n) \, , \quad n \in \mathbb{Z}.\]

This is a toy model for the behavior of an electron in a large domain. In order to model electron dynamics in the presence of random disordered environment, we can place assumptions on the randomness of the potential $V$. From the point of view of matrices, $H$ is an infinite Jacobi matrix with random elements along the diagonal. While this model is very simplified, random Schrodinger operators and their spectra turn out to have many interesting mathematical properties, including the prediction/replication of physically observed phenomena – see Chapter 9 of Cycon, Froese, Kirsch, Simon for an introduction which is fairly gentle, aside from a few typos. I will focus on two nice results: the integrated density of states and the Thouless formula.

Integrated Density of States

Let $(\Omega,\mathcal{F},\mathbb{P})$ be the canonical probability space associated with $V$, and suppose the potential $V$ is stationary and ergodic. For a given outcome $\omega\in\Omega$, let $V_\omega$ denote the potential, and $H_\omega$ the associated operator.

By elementary manipulations and applications of the ergodic property, it is possible to show that $\sigma(H_\omega)$, the spectrum of $H_\omega$, equals a deterministic set, call it $\Sigma$, $\mathbb{P}$-a.s. However, it is interesting to ask how the spectrum is distributed along this set $\Sigma$. Let $\delta_0\in \ell^2(\mathbb{Z})$ be the unit vector with value 1 in the 0th coordinate, and 0 elsewhere. Define the measure $dk$ by

\[\int_\mathbb{R} f(\lambda) dk(\lambda) := \mathbb{E} [\langle \delta_0, f(H_\omega) \delta_0 \rangle ].\]

Theorem. The support of $dk$ is $\Sigma$.

From just this definition and claim above, it is not clear why $dk$ is in any way related to the “distribution” of the spectrum. Note that the spectrum of $H_\omega$ is an infinite set, so it is not possibly to naively bucket and histogram to describe the distribution… but it essentially is! That is, we will restrict $H_\omega$ to a finite interval $[-L,L]$, compute the empirical density of the eigenvalues in this interval, and then see what happens as we take $L\rightarrow\infty$.

Let $\{\mathcal{E}_\Delta(\omega)\}_{\Delta\subset\mathbb{R}}$ represent the family of spectral projections associated with $H_\omega$, let $\chi_L$ be the indicator function of $[-L,L]$, and define the measure $dk_L$ by

\[\int_A dk_L := \frac{1}{2L+1} \text{dim Range}(\chi_L \mathcal{E}_A(\omega) \chi_L) = \frac{1}{2L+1} \text{tr}(\mathcal{E}_A(\omega)\chi_L).\]

Theorem. As $L\rightarrow\infty$, $dk_L$ converges vaguely to $dk$, $\mathbb{P}$-a.s.

Idea of Proof. First, prove that for a given bounded measurable function $f$, then $\int f dk_L \rightarrow \int f dk$, $\mathbb{P}$-a.s. To do so requires an application of Birkhoff’s ergodic theorem. Then, for each bounded measurable function $f$, we have a set $\Omega_f$ of measure 1 on which the desired behavior occurs. The conclusion of the proof is a classical approximation argument which exploits the separability of $C_0$ to stitch together countably many $\Omega_f$ to get a set of measure 1 on which the statement is true for all $f\in C_0$. $\square$.

Note that the prelimit measures $dk_L$ are random, so it is not a priori obvious (at least, to me) that $dk$ should be a deterministic measure! This offers a nice parallel to other results in random matrix theory, where taking the limit of empirical spectral distributions can give an unexpectedly explicit (and universal!) limiting measure – e.g., the semicircle law for Wigner matrices, or the circular law for matrices with iid elements.

Thouless Formula

Part of the joy of the one-dimensional assumption is that solutions to the eigenvalue problem $(H-E)u=0$ can be written in terms of $2\times 2$ transfer matrices since any solution is determined by its value at two adjacent points in $\mathbb{Z}$. That is, let $\mathbf{u}(n) = (u(n+1), u(n))$, and define

\[A_n(E,\omega) := \begin{pmatrix} E - V_\omega(n) & -1 \\ 1 & 0 \end{pmatrix}\]

Then,

\[\begin{array}{c} u(n+1) + u(n-1) + (V_\omega(n) - E) u(n) = 0\\ \Updownarrow\\ \mathbf{u}(n+1) = A_{n+1}(E,\omega) \mathbf{u}(n). \end{array}\]

Note that $\mathbf{u}(n)$ can be written in terms of the product of the random matrices $A_n(E)$ applied to some initial condition. Then, Furstenberg’s theorem tells us that for $E\in\mathbb{R}$ and $\mathbb{P}$-a.s. $\omega\in\Omega$, there exists $\gamma(E)$ such that

\[\gamma(E) := \lim_{N\rightarrow \pm \infty} \frac{1}{|N|} \log \left\| \prod_{i=0}^N A_i(E,\omega) \right\|\]

Theorem. $\gamma(E) = \int \log |E - E'| dk(E')$.

To me, this relationship is incredible! Unfortunately, I don’t have much intuition as to why it should be true, but the proof involves showing that a similar result holds in the finite $N$ case, and then exploiting the subharmonicity of $\gamma$. Moreover, this is not merely a nice connection between $\gamma$ and $k$, but also a fundamental ingredient of the proof that under certain conditions on $V$, the spectrum $\sigma(H_\omega)$ has no absolutely continuous part.

Random Schrodinger Operators was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.11.15.

Weak solutions of SDE

2014-09-17T00:00:00-04:00

Background
Weak but not strong

Background

Let $b:[0,\infty)\times\mathbb{R}^d \rightarrow \mathbb{R}^d$ and $\sigma:[0,\infty)\times\mathbb{R}^d \rightarrow \mathbb{R}^{d\times r}$ be Borel-measurable functions. We would like to “solve” the following SDE,

\[dX_t = b(t,X_t) dt + \sigma(t, X_t) dW_t,\quad 0\le t < \infty, \tag{$\star$}\]

where $W$ is an $r$-dimensional Brownian motion, and $X$ is a suitable stochastic process with continuous sample paths and values in $\mathbb{R}^d$ is the “solution” to the equation.

Strong solution

Fix a filtered probability space $(\Omega,\mathcal{F},\{\mathcal{F}_t\}, P)$¹. Recall that a strong solution to $(\star)$ w.r.t. fixed Brownian motion $W$ and initial condition $\xi$ is a process $X$ with continuous sample paths such that:

$X$ is $\{\mathcal{F}_t\}$-adapted;
$P(X_0 = \xi) = 1$;
for every $1 \le i \le d$, $1\le j \le r$, and $0\le t < \infty$, $P\left( \int_0^t \left\{ \lvert b_i(s,X_s)\rvert + \sigma_{ij}^2 (s, X_s)\right\} ds < \infty\right) =1;$
the integral version of $(\star)$ holds – that is, $P$-a.s.

\[X_t = X_0 + \int_0^t b(s,X_s) ds + \int_0^t \sigma(s,X_s) dW_s; \quad 0\le t< \infty.\]

The key to this definition is the adaptedness condition 1., which says that $X_t$ depends only on $W_s$ for $s$ up to time $t$. On the other hand, there is an alternative notion of solution which is in some sense less “pathwise” and more “distributional”.

Weak solution

A weak solution to $(\star)$ is a pairing of $(X,W)$ and $(\Omega,\mathcal{F},\{\mathcal{F}_t\}, P)$ such that:

$(\Omega,\mathcal{F},\{\mathcal{F}_t\}, P)$ is a filtered probability space satisfying the usual conditions;
$X$ is a continuous, adapted $\mathbb{R}^d$-valued proess and $W$ is an $r$-dimensional Brownian motion;
(see 3. for strong solutions);
(see 4. for strong solutions).

Note that for the case of strong solutions, a probability space was given; on the other hand, for the case of weak solutions, a probability space must be provided as part of the solution!

Simple example

An application of Girsanov’s theorem provides a nice example of a weak solution. Suppose we would like to find a solution to the SDE

\[dX_t = b(t,X_t)dt + dW_t , \quad 0\le t \le T \tag{$\dagger$}\]

where $T < \infty$ is a fixed positive number and $b:\mathbb{R}^d\rightarrow \mathbb{R}^d$ is a measurable function with sublinear growth

\[\|b(t,x)\| \le K(1+ \|x\|); \quad 0\le t \le T, \, x\in\mathbb{R}^d.\]

Let $(\Omega,\mathcal{F},P)$ be a probability space which supports a Brownian motion $X$, and let $\{\mathcal{F}_t\}$ be the (augmented) Brownian filtration (generated by $X$). Define the process $Z$ as

\[Z_t := \exp\left( \int_0^t b(s,X_s) dX_s - \tfrac{1}{2} \int_0^t \|b(s,X_s)\|^2 ds \right), \quad 0\le t \le T.\]

Due to the Benes condition (see, Karatzas & Shreve, Corollary 3.5.16), $Z$ is a martingale under $P$. Define the measure $Q$ via its Radon-Nikodym derivative $\frac{dQ}{dP} = Z_T$. By applying the Girsanov theorem, the process $W$ as defined by

\[W_t := X_t - X_0 - \int_0^t b(s,X_s)ds, \quad 0\le t \le T\]

is a Brownian motion with $Q(W_0 = 0) = 1$. It is easy to check that $(X,W)$ and $(\Omega,\mathcal{F}, \{\mathcal{F}_t\}, Q)$ constitute a weak solution to $(\dagger)$.

This example demonstrates the peculiarity of adaptedness. For a strong solution, we require $X_t$ to be $\{\mathcal{F}_t^W\} = \sigma(W_s;s\le t)$-measurable. On the other hand, the weak solution constructed above gives the “opposite” in some sense; here, $W_t$ is $\{\mathcal{F}_t^X\} = \sigma(X_s;s\le t)$-measurable. This distinction is important, and it leads to different interpretations of solutions. Weak solutions provide a very probabilistic interpretation, since to provide a weak solution is essentially to construct a measure. On the other hand, strong solutions allow us to make parallels to deterministic dynamical systems. That is, if an SDE has a strong solution, it can reasonably be interpreted as “an ODE with noise”, and Wong-Zakai type approximations should hold.

Weak but not strong

Given that we have two notions of solution, we should have some examples which explicitly demonstrate that they are in fact distinct. There is also an important philosophical question of what it means to be a weak solution; that is, if the randomness of $X$ cannot be explained entirely through $W$, then where is this “extra randomness” coming from, and how should we interpret it?

Tanaka example

Consider the SDE with drift $b(x) \equiv 0$ and diffusion $\sigma(x) = \text{sgn}(x)$. That is,

\[X_t = \int_0^t \text{sgn}(X_s) dW_s. \tag{$\ddagger$}\]

Note that the quadratic variation is $\langle X \rangle_t = t$, so by Lévy’s characterization of Brownian motion, $X$ is a Brownian motion. Then, define $W$ as:

\[W_t = \int_0^t \text{sgn}(X_s) dX_s.\]

It is apparent that $(X,W)$ and $(\Omega,\mathcal{F}, \{\mathcal{F}_t^X\}, P)$ form a weak solution to $(\ddagger)$. Note that as in the case of weak solutions formed by Girsanov theorem, $\mathcal{F}_t^W \subset \mathcal{F}_t^X$.

In fact, $(\ddagger)$ admits no strong solution at all! Suppose it did. First, recall the Tanaka formula for local time (at zero). For a Brownian motion $B$ starting at zero, where $L_t^B$ is its local time at 0,

\[2L_t^B = |B_t| - \int_0^t \text{sgn}(B_s) dB_s; \quad 0\le t < \infty.\]

Then, since $X$ is a Brownian motion started at zero,

\[\begin{align*} W_t &= \int_0^t \text{sgn}(X_s) dX_s \\ &= |X_t| - 2L_t^X\\ &= |X_t| - \lim_{\epsilon\downarrow 0} \frac{1}{2\epsilon} \text{meas} \{ 0\le s \le t: \lvert X_s \rvert \le \epsilon \}; \quad 0\le t < \infty, P-\text{a.s.} \end{align*}\]

Combined with the adaptedness condition for strong solutions, this implies

\[\mathcal{F}_t^X \subset \mathcal{F}_t^W \subset \mathcal{F}_t^{|X|},\]

which is a contradiction.

Tsirelson Example

One might think that if the coefficients $b$ and $\sigma$ are sufficiently well-behaved, then one can obtain a weak solution using Girsanov theorem, and then somehow prove pathwise uniqueness to show that such a solution is in fact strong. In principle, this is true, but let’s consider an example where $\sigma \equiv 1$ (avoiding the issue of the sign function) and $b$ is bounded (which would presumably prevent any sort of explosion), except now we let $b:[0,\infty)\times C([0,\infty);\mathbb{R})\rightarrow \mathbb{R}$ progressively measurable, meaning the drift depends on the entire past history of $X$ instead of just on a single point $X_t$. That is, we wish to solve the functional SDE:

\[dX_t = b(t,X) dt + dW_t,\quad 0\le t < \infty. \tag{$\S$}\]

The following discussion is taken from Yor, Revuz p.392. Let $(t_k)_{k\in -\mathbb{N}}$ be a strictly increasing sequence such that $0 < t_k < 1$ for $k <0$, $t_0 = 1$, and $\lim_{k\rightarrow -\infty} t_k = 0$. Then, set

\[\begin{align*} \tau (t,z) &= \left[ \frac{ z(t_k) - z(t_{k-1}) }{ t_k - t_{k-1} } \right] \quad \text{ if } t_k < t\le t_{k+1} \\ &= 0 \quad \text{ if } t= 0 \text{ or } t > 1, \end{align*}\]

where $[x]$ indicates the fractional part of a real number $x$. Let $(X,W)$ denote a solution to $(\S)$ with $b= \tau$. For $t_k < t \le t_{k+1}$, let $\eta_t = \frac{X_t - X_{t_k}}{t - t_k}$ and let $\epsilon_t = \frac{W_t - W_{t_k}}{t- t_k}$. Then,

\[\eta_t = \epsilon_t + [\eta_{t_k}].\]

Note that $\mathcal{F}_t^X = \sigma([\eta_t]) \vee \mathcal{F}_t^W$. The lack of a strong solution follows from the following claim, which can be proved through from some elementary calculations involving conditional expectation and characteristic functions:

Lemma. For $t\in [0,1]$, the random variable $[\eta_t]$ is uniformly distributed on $[0,1]$, and independent of $\mathcal{F}_1^W$.

Discrete time analog

In this section, we analyze a discrete-time version of the Tanaka example, for which a “strong solution” does exist! We proceed as in Warren 1999. Let $X=(X_n)_{n\in\mathbb{N}_0}$ be the symmetric nearest neighbor random walk on $\mathbb{Z}$, and define $W= (W_n)_{n\in\mathbb{N}_0}$ by setting $W_0 = 0$ and

\[W_{n+1} - W_n = \text{sgn}(X_n)(X_{n+1} - X_n).\]

Note that $W$ is also a symmetric nearest neighbor random walk on $\mathbb{W}$, and we can write

\[X_n = \sum_{k=0}^{n-1} \text{sgn}(X_k) (W_{k+1} - W_k), \tag{$\parallel$}\]

the discrete version of the SDE $(\ddagger)$. Moreover, we can obtain discrete versions of the Tanaka formula. Let $L_0=0$ and define $L_n = \sum_{k=0}^{n-1} \mathbb{1}_{\{X_k,X_{k+1} \in \{0,-1\}\}}$. Then,

\[\begin{align*} |X_n + \tfrac{1}{2}| - \tfrac{1}{2} &= \sum_{k=0}^{n-1} \text{sgn}(X_k) (X_{k+1} - X_k) + L_n\\ &= W_n + \sup_{k\le n} (-W_k). \end{align*}\]

In light of this formula, we can show that in the discrete setting, $X$ is fully determined by $W$.To do so, it remains to show that $\text{sgn}(X_n + \tfrac{1}{2})$ can be determined from $\{W_k\}_{k\le n}$. For $n\in\mathbb{N}$, define

\[m_n := \sup \left\{ m \in \{0,1,\cdots, n\} : W_m = -\sup_{k \le m} (-W_k) \right\} .\]

Note that $X_{m_n} \in \{0, -1\}$, and for all $ m_n < \ell \le n$, we know $W_\ell > - \sup_{k\le \ell} (-W_k)$, meaning $\lvert X_\ell + \tfrac{1}{2}\rvert > \tfrac{1}{2}$. This implies that

\[\begin{align*} X_{m_n} + \tfrac{1}{2} = +\tfrac{1}{2} &\quad \Rightarrow \quad X_\ell + \tfrac{1}{2} > 0, \\ X_{m_n} + \tfrac{1}{2} = -\tfrac{1}{2} &\quad \Rightarrow \quad X_\ell + \tfrac{1}{2} < 0. \end{align*}\]

Thus, $X_n$ is measurable with respect to $\mathcal{F}_n^W = \sigma(W_k : k \le n)$. That is, this adaptedness property gives us a “strong solution” to $(\parallel)$, even though no such strong solution exists for the SDE $(\ddagger)$!

One interpretation of this phenomenon is that there is some connection between the “loss of information” about the sign, and the fact that $x\mapsto \text{sgn}(x)$ is “noise sensitive”. As described by Warren, suppose we have a pair of random walks $W$ and $W’$ such that the step sizes have correlation $\rho \in (0,1)$. From this pair of noises, define $X$ and $X’$ as in $(\parallel)$. In the asymptotic limit as $n$ grows large, it is possible to show that $\text{sgn}(X_n)$ and $\text{sgn}(X_n’)$ are uncorrelated, regardless of the correlation $\rho \in (0,1)$, indicating that the sign function is asymptotically sensitive to any non-zero perturbation of the noise $W$.

The fundamental question is: What is happening the scaling limit?! Questions of this nature are discussed in brief in the Warren paper, and can also be found in some works of Tsirelson.

In order to avoid dealing with completions or augmentations of any kind, we will always assume the usual conditions on any filtration discussed. That is, the filtration $\{\mathcal{F}_t\}$ is right-continuous and $\mathcal{F}_0$ contains all $P$-null sets in $\mathcal{F}$. ↩

Weak solutions of SDE was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.09.17.

Sparse PCA

2014-09-16T00:00:00-04:00

I recently sat in on the applied probability topics seminar at MIT, which will spend the semester covering some very modern topics in statistics (in particular: sparse PCA, matrix completion, and community detection). This past Friday, two students gave a very nice overview of the statistical and computational aspects of the semidefinite relaxation developed in d’Aspremont, El Ghaoui, Jordan, Lanckriet 2007. As a brief addendum to my previous post on relaxations, and as a reminder to myself, I’d like to review sparse principal component analysis (PCA), or at least the basic problem formulation.

Recall the setup of PCA. For $n$ observations with $p$ features, denote the data matrix by $X \in \mathbb{R}^{n\times p}$, and the sample covariance by

\[S = \frac{1}{n} X^T X - \frac{1}{n^2} X^T \mathbb{1}_n \mathbb{1}_n^T X \in \mathbb{R}^{p\times p},\]

where $\mathbb{1}_n$ is the $n\times 1$ vector of ones. The notion behind (classical) PCA is to find an orthonormal subset of vectors which will “explain” much of the data. To be precise, for $j=1,\cdots, p$, the $j$-th principal component is defined as follows:

\[v_j = \arg \max_{\substack{ v \in \mathbb{R}^p \\ \text{s.t. } \| v\|_2 = 1, \\ v \perp v_1, \cdots, v_{j-1} } } v^T S v.\]

For some intuition on the objective function, note that $v^T S v$ is the empirical variance of $X v$, the $n$ samples projected onto the vector $v$. By selecting vectors to maximize empirical variance, we are finding the most influential aspects of the data. From one perspective, this optimization problem is already somewhat hard, since it is doubly non-convex: we are asked to maximize a convex function $v\mapsto v^T S v$, and we are optimizing over the sphere $\|v \|_2 = 1$, a non-convex set. On the other hand, if we combine the unit normal constraint with the objective function, this problem becomes one of maximizing the Rayleigh quotient. That is, the $j$-th principal component is the eigenvector associated with $\lambda_j$, the $j$-th largest eigenvalue of $S$, so PCA becomes a problem of eigenvalue decomposition ¹. Note that the “total variance” of the data can be written as $\text{tr} (S) = \lambda_1 + \cdots + \lambda_p$.

We would like to use PCA not only for dimension reduction (i.e., by projecting onto $p’ \ll p$ dimensions which capture much of the variance of the data), but also for interpretation of the $p$ factors. The trouble is that in general, each principal component $v_i$ has non-zero elements at each coordinate. What if we would like to have only $k \ll p$ non-zeros? Then, for the first principal component, we would like to solve:

\[\begin{align*} \tag{$\star$} \text{ max } \, & v^T S v \\ \text{ s.t. } & \|v\|_2 = 1\\ & \text{card}(v) \le k. \end{align*}\]

The cardinality constraint makes this problem even more difficult due to the additional combinatorial aspect involved. Note that we can rewrite $v^T S v = \text{tr}(Svv^T)$, which inspires the following rewrite of $(\star)$, which is more conducive to a semidefinite relaxation:

\[\begin{align*} \tag{$\dagger$} \text{ max } \,& \text{tr}(SV) \\ \text{ s.t. } & V \succeq 0\\ & \text{tr}(V) = 1\\ & \text{card}(V) \le k^2\\ & \text{rank}(V) = 1. \end{align*}\]

Note that this formulation is already quite nice: the objective is linear in $V$ instead of quadratic in $v$, and the constraint $\|v\|_2=1$ has been changed into linear constraints on $V$ (positive semi-definiteness and trace = 1). However, the cardinality and rank constraints are still combinatorial in nature. To this end, since $\text{tr}(V) = 1$ and $\text{rank}(V) = 1$, note that the cardinality constraint $\text{card}(V) \le k^2$ implies

\[\mathbb{1}_p^T |V| \mathbb{1}_p = \|\text{vec}(V)\|_1 \le k \|\text{vec}(V)\|_2 = k\|V\|_F = k.\]

As for the rank constraint, simply drop it to obtain the relaxation:

\[\begin{align*} \tag{$\ddagger$} \text{ max } & \text{tr}(SV) \\ \text{ s.t. } & V \succeq 0\\ & \text{tr}(V) = 1\\ & \mathbb{1}_p^T |V| \mathbb{1}_p \le k. \end{align*}\]

Note that $(\ddagger)$ is an SDP, with variable $V \in \mathbb{S}^p$. The optimal value achieved by this optimization problem acts as an upper bound to the solution of the original problem $(\star)$.

Such is the problem setup. There are several questions one can ask regarding this problem, none of which I will go into in detail:

Does solving this relaxed problem produce feasible solutions (i.e., sparse vectors)? Numerical experiments suggest that this is frequently the case.
What if instead of the constrained problem, we analyze the penalized problem, with objective $v^T S v - \rho \,\text{card}(V)$? It turns out that the dual of the relaxed penalized problem can be interpreted as a “worst-case” computation of the maximum eigenvalue of a perturbed version of $S$.
How can we actually solve the associated SDP? One might immediately turn to interior point methods, but the $O(p^2)$ constraints make Newton’s method too costly. On the other hand, first-order methods have cheap iterations, low memory requirements, and lend themselves to parallelization. The cost is that they converge slowly (typically something like $O(1/\epsilon)$ for $\epsilon$ precision); but this cost is somewhat artificial since the statistical nature of this problem means that we only care about achieving computational error up to some statistical threshold.

Note that since $S$ is a symmetric and positive semi-definite real matrix, the eigenvalues are all real and non-negative. ↩

Sparse PCA was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.09.16.

Relaxations in Optimization

2014-08-31T00:00:00-04:00

$\mathbb{R}$ vs. $\mathbb{Z}$: LP Relaxations of IP
The Great Watershed: Convex Relaxation
1. Semidefinite programming
2. Goemans, Williamson (MAX CUT)
Sparsity, Statistics, and Selection
1. LASSO
2. Matrix Completion
Related Notions

I’m excited to announce that I am tentatively assigned to be a TA for APMA 1210 in Fall 2014. The course will introduce the elements of operations research, with a focus on deterministic optimization methods. I am glad that Brown offers an undergraduate course on this subject, since it’s been my experience that many mathematicians overlook the structural beauty, historical importance, and practical relevance of optimization theory.

Since the semester is fast approaching, I decided to review a little bit for myself. I find myself particularly amazed by the recurring theme of relaxation (or rather, how well relaxations manage to work). For those new to the concept, recall that mathematicians are known for simplifying: they like to turn mathematically “hard” problems into mathematically “simple” ones. Analogously, computer scientists like to turn computationally hard problems into computationally tractable ones. The trick is, in both cases, upon solving the tractable problem, one must check whether it tells you anything useful about the original hard problem.

To be a little bit more precise, let $X$ be some set and $f: X\rightarrow\mathbb{R}$ some function. Then, we wish to compute

\[\min_{x\in X} f(x),\]

and also find the minimizing value(s) $x^*$. But suppose that finding this optimal solution is quite hard; the broad idea of relaxation is to settle for a suboptimal solution that can be found more easily. That is, consider an alternative set $\tilde{X}$ that is “similar” to $X$, and an alternative function $\tilde{f}:\tilde{X}\rightarrow\mathbb{R}$ that is “similar” to $f$, and then compute

\[\min_{x\in \tilde{X}} \tilde{f}(x)\]

and the minimizing value $\tilde{x}^*$. A miracle happens when finding the solution to the relaxed problem $\tilde{x}^* \in \tilde{X}$ can produce an approximate solution $x’ \in X$ that works reasonably well for the original problem.

As a side note, I think the timing is perfect for an undergraduate to learn about optimization, relaxation, and approximation. On the practical side, optimization is an increasingly important part of our computational and data-oriented world. On the theoretical side, the recent ICM 2014 provided a showcase of some recent aspects of optimization and complexity (as it has in previous years). In particular, the Nevanlinna Prize was recently awarded to Subhash Khot for his work on the Unique Games Conjecture, which offers a lens through which one can analyze the critical frontier of approximate computability. Also at the ICM, there was a plenary lecture given by Emmanuel Candes, whose work on the computational end of compressed sensing sparked many modern developments approximation and relaxation theory. Moreover, ICERM will be hosting a workshop on Approximation, Integration, and Optimization in Fall 2014. While APMA 1210 will only offer a peek at optimization, it should build up some of the classical foundations underlying the very modern mathematics described above.

In the remainder of the blog post, I will survey a few examples of relaxation, with a particular focus on turning combinatorial optimization problems into linear/convex optimization problems.

$\mathbb{R}$ vs. $\mathbb{Z}$: LP Relaxations of IP

Recall the typical form of a linear program (LP). Given $m,n\in\mathbb{N}$, $A \in \mathbb{R}^{m\times n}$, $b\in \mathbb{R}^m$, $c\in \mathbb{R}^n$, we wish to find $x^*\in\mathbb{R}^n$ which solves:

\[\begin{align*} \text{ min } & c^T x\\ \text{ s.t. } & Ax \le b \\ & x_i \ge 0, \quad i=1,\cdots, n. \end{align*}\]

This problem has an incredibly rich mathematical history, from the war-changing work of Kantorovich, to the legal issues on patentability that arose as a result of Karmarkar’s algorithm. The linearity assumption imposes strong structural constraints on this optimization problem. In particular, a linear function on a convex polytope has the nice property that it will achieve its optima at the polytope’s corners. There are several ways to solve such an LP: e.g., Dantzig’s simplex method (which works well empirically) and Khachiyan’s ellipsoid method (which has a “better” theoretical guarantee). For the practitioner who wishes to use LP as a black box, or for the mathematician who wishes to reduce harder problems to LP, the most important fact is that an LP can solved in polynomial time with respect to number of variables $n$.

Of course, we should ask what is meant by “polynomial time”, since we are dealing with real-valued solutions¹. There are (at least) two notions of complexity when dealing with optimization.

Rational Arithmetic Model – The computational cost is measured in terms of the number of arithmetic operations and comparisons on rational numbers (which can be represented as finite-length binary words). This is reasonably similar to a physical model of computation. Under this model, Khachiyan’s ellipsoid method showed that an LP can solved in time polynomial with respect to $n$, the number of variables, and $L$, the size of the problem in terms of bits required to represent it.
Information Complexity – Here, complexity is measured in terms of number of calls to an “oracle” which takes input $x$ and outputs the objective $f(x)$ and its gradient $\nabla f(x)$. There is also dependence on $\epsilon$, the level of precision desired for a solution. Roughly speaking, this model measures not the number of computations, but the number of “iterations” an algorithm takes. This measure of complexity is particularly natural in statistics for two reasons: first, optimization problems in statistics typically have an unknown objective function for which there is limited data; second, there is little point in finding an exact minimum, since in addition to computational error of imprecise minimization, there will always be a level of statistical error as quantified by whatever generalization bound exists for the problem.

For a more expansive discussion on posing the question of complexity in the context of optimization, see Nemirovski’s notes on linear optimization §6.1 and convex optimization §1.2.

Without going into further detail about complexity, one should be happy if a problem can be reduced to an LP. In practice, many problems are “almost” an LP, but not quite. For a concrete (but not very serious) example, suppose you go to a restaurant and set a $50 budget for yourself. There are various items you can order, each of which gives you a certain amount of happiness per item (e.g., $h_{\text{pâté}}, h_{\text{ceviche}}, h_{\text{shortrib}}$), but comes at a certain price level (e.g., $p_{\text{pâté}}, p_{\text{ceviche}}, p_{\text{shortrib}}$). You want to maximize your happiness $\sum_{i \in \text{menu}} h_i x_i$, given that you must spend stay within your budget $\sum_{i \in \text{menu}} p_i x_i \le 50$. Unfortunately, the restaurant doesn’t let you order a non-integer amount of short rib dishes, so you must select $x_i \in \mathbb{Z}$. The same applies to other practical problems like vehicle scheduling, employee assignment, and resource allocation.

The situation described above is an example of a combinatorial problem known as integer programming (IP). The general setup is similar to LP, but with one additional constraint:

\[\begin{align*} \text{ min } & c^T x\\ \text{ s.t. } & Ax \le b \\ & x_i \ge 0\\ & x_i \in \mathbb{Z}, \quad i=1,\cdots, n. \end{align*}\]

As one might guess from the combinatorial nature of integer programming, it is in general NP-hard to find a solution². It turns out that several algorithmic problems can be formulated as integer programs: TSP, cover problems, and satisfiability problems to name a few. This shouldn’t come as a surprise, since all of these problems are vaguely questions of assigning elements under certain constraints.

Given this computational difficulty, and the fact that LP is so “easy” in comparison, it is natural to try to simply discard the $x\in\mathbb{Z}$ condition and reduce an IP to an LP. This is known as linear programming relaxation, since we “relax” the integer constraint. The solution to the LP provides a lower bound on the optimal value of the IP (since it is minimizing over a larger set, with fewer constraints). However, solving the LP does not immediately provide a candidate solution to the IP, since the resulting optimal $x$ will, in general, not be integer-valued. Here we should usurp the words of John Tukey:

Far better an approximate answer to the right question … than an exact answer to the wrong question.

To this end, we can consider randomized rounding of a solution to the LP to obtain an integer-valued candidate. The general idea is to round each coordinate of the LP solution $\tilde{x}^*$ according to some rule, in order to obtain a vector of integers $x’$. Then, with some probability (or, always, via derandomization), $x’$ is a candidate for the original integer program which is “almost as good” as the optimal solution.

Consider the set cover problem: fix a set of elements $U={1,\cdots, M}$, a collection $S$ of $n$ sets whose union equals $U$, and costs $c_1,\cdots,c_n$ attached to each set in $S$; find a subset of $S$ whose union equals $U$ while minimizing total cost. This problem is NP-hard. However, rounding the solution from an LP relaxation can efficiently provide a solution which is not bad:

Theorem. A (derandomized) rounding scheme for the set cover problem can return a candidate of cost $O(\log M)$ times the cost of the optimal set cover.

One might think the set cover problem is just a lucky special case where the math happens to work out, but hopefully with the additional examples below, it becomes clear that relaxation is a broadly applicable approach towards obtaining approximate solutions in a much shorter time.

The Great Watershed: Convex Relaxation

Based on the above discussion on linear programs, one might naturally wonder about nonlinear programs. In dynamical systems and PDE, “linearity” proves to be a very useful property, and “nonlinearity” is more difficult to analyze. This is somewhat true for optimization as well, and in fact, one can find a vast literature that references “nonlinear programming”. However, I tend to agree with the following quote from R.T. Rockafellar:

In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.

Recall the typical form of a convex program (CP). Given a convex set $D\subset \mathbb{R}^n$ (or, more generally, a real vector space), and a convex function $f: D\rightarrow \mathbb{R}$, we wish to find $x^*\in \mathbb{R}^n$ which solves:

\[\begin{align*} \text{ min } & f(x)\\ \text{ s.t. } & x\in D \end{align*}\]

Of course, convex optimization is certainly harder than linear optimization: abstractly, linearity is a trivial type of convexity; practically, convex functions can achieve their minima on the interior as well as on the boundary. Nonetheless, convexity is an incredibly strong assumption when searching for optima, since for a convex function on a convex domain, any local optimum is a global optimum. Because of this, a great many algorithms for solving convex programs rely on the fundamental principle of gradient descent; that is, since “local” searches are provably asymptotically correct, an algorithm can keep taking small steps in the direction of a region of lower potential. Of course, there is a great deal of elegance in choosing the precise step size and direction in order to achieve optimal error, but I won’t go into detail here. Just as in the case of LP, one should be happy if a problem can be reduced to a CP.

Semidefinite programming

A particularly interesting case of convex optimization is semidefinite programming (SDP). Essentially, the task is to optimize over matrices instead of typical vectors. Let $\mathbb{S}^n$ denote the set of $n\times n$ real symmetric matrices. For $A,B \in \mathbb{S}^n$, define the inner product $\langle A, B\rangle = \text{tr}(A^TB)$. Fix $m$. For $C, A_1,\cdots, A_m \in \mathbb{S}^n$, and $b_1,\cdots, b_m \in \mathbb{R}$:

\[\begin{align*} \text{ min } & \langle C, X\rangle \\ \text{ s.t. } & X \in \mathbb{S}^n \\ & \langle A_k, X\rangle \le b_k, \quad k=1,\cdots,m \\ & X \succeq 0. \end{align*}\]

I know very little about the details of algorithms which can solve SDPs, but the guarantee as I know it is this: to obtain a solution up to additive error $\epsilon$, an algorithm can output a solution in time polynomial in $n$ and $\log(1/\epsilon)$.

Goemans, Williamson (MAX CUT)

One of the most famous and well-cited examples of convex relaxation to an SDP is the application to the MAX CUT problem by Goemans and Williamson. Given a graph $G= ([n],E)$ and weights $W_{ij}= W_{ji}$ for $(i,j) \in E$, the maximum cut problem is to find a subset $S\subset [n]$ such that the weight of the edges in the cut $(S, S^c)$ (that is, the sum of the weights of the edges between $S$ and $S^c$) is maximized. This problem is known to be NP-complete, so we should consider approximate solutions, again through relaxation and randomization. To formulate our problem precisely, we want to solve:

\[\begin{align*} \tag{$\star$} \text{ max } & \frac{1}{2} \sum_{i,j=1}^n W_{ij} (x_i-x_j)^2\\ \text{ s.t. } & x_i \in \{-1, +1\}, \quad i=1 ,\cdots ,n \end{align*}\]

Let $W = (W_{ij})_{i,j}$ be the weighted adjacency matrix of $G$. Let $D$ be the diagonal matrix with $i$-th entry $\sum_{j=1}^n W_{ij}$. Then, we define $L=D-W$ to be the graph Laplacian. Then, note that $\frac{1}{2} \sum_{i,j=1}^n W_{ij} (x_i-x_j)^2 = x^T L x$. Using the inner product on matrix space, we can further rewrite $x^TL x= \langle L, xx^T\rangle$. Thus, we can rewrite the above problem as

\[\begin{align*} \tag{$\dagger$} \text{ max } & \langle L, xx^T\rangle \\ \text{ s.t. } & x\in \{-1,+1\}^n \end{align*}\]

Then, a natural convex relaxation is to move from the combinatorial problem of optimizing over matrices $xx^T$ such that $x\in \{-1,+1\}^n$, to a convex problem by searching over a larger space of matrices.

\[\begin{align*} \tag{$\ddagger$} \text{ max } & \langle L, X\rangle \\ \text{ s.t. } & X \in \mathbb{S}^n\\ & X \succeq 0\\ & X_{ii} =1, \quad i=1,\cdots,n \end{align*}\]

In essence, the Goemans-Williamson algorithm is: solve the relaxed problem $(\ddagger)$, generate a uniformly random hyperplane in $\mathbb{R}^n$, separate the vertices $[n]$ by seeing on which side of the hyperplane the column vectors of $X$ fall. The interesting thing is how much such an approach helps! To be precise (and taking a slightly different, but equivalent, geometric view): let $X$ be the solution to $(\ddagger)$, let $\xi \sim N(0,\Sigma)$, and let $\zeta = \text{sign}(\xi) \in \{-1,+1\}^n$. Note that $\zeta$ is a random candidate for a cut of $[n]$.

Theorem. $\mathbb{E} \langle L, \zeta \zeta^T\rangle \ge 0.878 \cdot \text{ optimal solution value of MAX CUT }$

For reasons I don’t quite understand, I have read that this approximation algorithm (and the 0.878 ratio) is in some sense optimal, assuming the validity of the Unique Games Conjecture!

Sparsity, Statistics, and Selection

The fundamental concept behind the combinatorial problems described above is the selection of an optimal subset. It turns out that problems of selection are also common in statistics and computational learning theory. Suppose we have data which is high-dimensional in some sense (in that there are far more parameters than data points); solving such a problem is in general quite hard, but it becomes much more tractable if we can assume there is some sort of sparse low-dimensional structure. The key is to properly select the low dimensions.

LASSO

Consider the classical problem of regression. Fix $n$ samples and $p$ dimension such that $p \gg n$. Given data $(X, y)$ and coefficients $\theta$, where

\[X = \begin{pmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix} \in \mathbb{R}^{n\times p},\]

$y \in \mathbb{R}^n$, and $\theta \in \mathbb{R}^p$, we assume:

\[y = X \theta + \epsilon,\]

where $\epsilon \in \mathbb{R}^n$ is some noise³. Our goal is to estimate $\theta$ given the data. The standard estimator is the ordinary least squares (OLS) estimator:

\[\hat{\theta}^{\text{LS}} = \arg\min_{\theta \in \mathbb{R}^p} \| y - X\theta \|_2^2 ,\]

where $\|z\|_q = \left(\sum_{j=1}^m z_j^q\right)^{1/q}$ is the $\ell^q$-norm of a vector $z$. This is a theoretically and practically simple estimator, but OLS estimates typicall suffer from the problem of overfitting. That is, OLS estimates typically have low or no bias, but high variance. These estimates are too specific to a particular data instance to provide good predictions. On the other hand, prediction error can actually be reduced by shrinking some of the coefficients $\theta$. Moreover, one might argue that anything which happens in the universe is affected in some small magnitude by everything else in the universe – but for modeling purposes, we should select only the few parameters which exhibit the most significant effects.

One way to resolve these problems is to set coefficients of $\theta$ to zero when they are sufficiently small, and to scale them down otherwise. This point of view leads to thresholding estimators, which is a useful characterization when $X$ is an orthonormal matrix.

However, to fit with the theme of relaxation, we will take an alternate but equivalent view. We would like to select $\theta$ with minimal residual squared error among a small subset of $\mathbb{R}^p$. The Lagrangian formulation of such an approach is penalized least-squares. That is, we would like to analyze estimates of the sort:

\[\hat{\theta}^{g} = \arg\min_{\theta \in \mathbb{R}^p} \| y - X\theta \|_2^2 + g(\theta),\]

where $g:\mathbb{R}^p \rightarrow \mathbb{R}$ is some sort of penalty on large, overfitting values of $\theta$.

The classical example of such a penalty is:

\[g(\theta) = \|\theta\|_2^2.\]

This penalty leads to a method is known as ridge regression, which produces estimates of smaller $\ell^2$ norm and lower bias than OLS. Moreover, the optimization problem is still a convex (indeed, quadratic) program, and thus rather simple to solve. Unfortunately, this process will not set any coefficients to zero, so it does not produce the “low-dimensional” model we would like.

An alternate approach is to choose the “$\ell^0$ norm”⁴ as the penalty:

\[g(\theta) = \|\theta\|_0 = \# \{ i \in [p] : \theta_i \ne 0 \}\]

This penalty produces estimates with only a few non-zero coefficients, but it tends to be very unstable in that slight perturbations to the data lead to highly different selected models. Moreover, the optimization problem becomes one of combinatorial optimization since one must select a subset of $p$ possible dimensions. In fact, solving the penalized optimization problem with $\ell^0$ penalty is in general NP-hard, which one might expect since exhaustive search over all subsets of columns of $X$ has exponential complexity in $p$.

Given that $\ell^2$-penalization provides a tractable optimization problem which provides shrinkage but no selection, and $\ell^0$-penalization provides a difficult optimization problem which provides selection but no shrinkage, it is natural to ask what lies in between. To this end, set

\[g(\theta) = \|\theta\|_1.\]

This penalty produces the least absolute shrinkage and selection estimator (lasso), developed by Tibshirani in 1996. Fortuitously, the $\ell^1$ penalty produces behavior which exactly interpolates between the behavior of the $\ell^0$ and $\ell^2$ penalties: the lasso shrinks some coefficients and sets others to zero. Moreover, using the $\ell^1$ norm turns penalized least-squares into a convex optimization problem. That is, the lasso is the convex relaxation of the $\ell^0$-penalized estimator.

For a suggestion of why, consider the “unit balls” in the plane $\{x \in \mathbb{R}^2 : \|x\|_q \le 1 \}$ for various values of $q$. For $q=0$, the unit ball looks like a cross + (with arms extended out to infinity); clearly this is not convex. For $0 < q < 1$, the unit ball looks like a concave diamond ⟡. For $q=1$, the unit ball is convex and looks like a typical diamond ◆. In fact, $q=1$ is the smallest value of $q\ge 0$ for which the unit ball is a convex set, and correspondingly, the smallest value of $q$ for which the norm is a convex function.

Geometrically, this “minimal” amount of convexity is quite important. The reason the $\ell^1$-norm still provides selection is due to the presence of corners. The penalized/constrained optimization problem of computing an estimator can be seen geometrically as finding the intersection of the norm ball and the contours of the squared error function; given that the corners of the $\ell^1$ ball are on the axes (representing few non-zero coefficients), such a penalization will provide a sparse estimator. Below is a figure from Tibshirani 1996.

Theorem. With high probability, the mean squared error of the lasso estimator is within a $\log p$ factor of the “best-possible” estimator (i.e., an an estimator if we knew which elements were nonzero in advance).

For precise results on the performance of the lasso, see Candes, Tao 2007 and Bickel, Ritov, Tsybakov 2009.

Matrix Completion

This post has already gotten quite long, and there is fortunately a wealth of popular science literature on matrix completion, so I will keep this next section particularly vague and imprecise for the sake of brevity.

The matrix completion problem is commonly introduced in the context of the Netflix prize (hence the image for this post). Suppose we have an array where each column represents a film, each row represents a user, and each element is the user’s rating of a particular film. We might have many users, but each user is likely to rate only a small subset of all the films available. Netflix had a contest in the mid-2000s to develop a method to predict users’ ratings for films. This might seem quite hard, but it’s natural to assume that among several million users, there are perhaps only a few thousand “typical user” profiles which constitute the preferences of all the other users.

Mathematically, let $M\in \mathbb{R}^{n_1\times n_2}$. Suppose we only observe $M_{ij}$ for $(i,j) \in N \subset [n_1]\times [n_2]$. The problem is to reconstruct the rest of $M$ given these few observations. In general, this is impossible! But the “typical user” assumption in the context of the Netflix prize can be interpreted as a low rank assumption. That is, we wish to solve the optimization problem:

\[\begin{align*} \text{ min } & \text{rank}(X) \\ \text{ s.t. } & X \in \mathbb{R}^{n_1\times n_2}\\ & X_{ij} = M_{ij} \quad \forall (i,j) \in N \end{align*}\]

Once again, we have a combinatorial selection problem, this time due to the rank, and once again, we can attempt to solve the convexification:

\[\begin{align*} \text{ min } & \|X\|_* \\ \text{ s.t. } & X \in \mathbb{R}^{n_1\times n_2}\\ & X_{ij} = M_{ij} \quad \forall (i,j) \in N, \end{align*}\]

where $\|X\|_*$ is the nuclear norm of $X$, the sum of its singular values.

Theorem. Under some regularity conditions on the matrix $X$, and assuming we observe sufficiently many random entries of $M$, then with high probability, nuclear norm minimization will exactly recover $M$.

For precise results, see Candes, Tao 2004, Candes, Recht 2009 and Candes, Tao 2010.

The principle of convexification occurs in at least two other contexts, neither of which I will go into any detail on at all:

The Bethe approximation of the log partition function of a Gibbs measure is not necessarily convex, but a “convexified” version can give an upper bound, as analyzed by Wainwright, Jaakkola, Willsky 2005.
Convexification by applying the Legendre-Fenchel transform is a technique used in Calculus of Variations, an area of analysis focusing on infinite-dimensional optimization problems.

For a broad view of convex relaxation and associated computational and statistical problems, see Chandrasekaran, Jordan 2013.

Indeed, even loading an arbitrary real number into a computer would take infinite time and memory! ↩
Naive intuition would suggest the opposite: that by restricting to $\mathbb{Z}$, there are fewer $x$ to “pick from” as compared with searching through $\mathbb{R}$, so it should be easier to find an optimal solution. Unfortunately, the additional constraint tends to make the optimization more difficult. The key is to remember that the structure which made LP tractable is removed with the addition of an integer constraint. ↩
This is the general setup of linear regression, but for more general problems, we can fix a dictionary of functions $\varphi_1,\cdots, \varphi_{p^*} : \mathbb{R}^p \rightarrow \mathbb{R}^{p^*}$, and solve the linear regression problem in $p^*$ dimensions. ↩
Note that this is not a norm due to the lack of positive homogeneity. However, it is graced with such a name since $\lim_{q\rightarrow 0} \|\theta\|_q = \|\theta\|_0$. ↩

Relaxations in Optimization was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.08.31.

Poincaré Inequalities

2014-07-23T00:00:00-04:00

The (classical) Poincaré inequality
1. The Poincaré constant
Probabilistic Poincaré inequality
Gaussian Poincaré inequality
1. Proof by Efron-Stein inequality
2. Proof by Markov semigroups
Other Poincaré inequalities

The (classical) Poincaré inequality

In functional analysis, Sobolev inequalities and Morrey’s inequalities are a collection of useful estimates which quantify the tradeoff between integrability and smoothness. The ability to compare such properties is particularly useful when studying regularity of PDEs, or when attempting to show boundedness in a particular space in order to apply the direct method in the calculus of variations.

The Poincaré inequality is an example of this kind of estimate. Let $1\le p \le \infty$ and $U$ a bounded, connected, open subset of $\mathbb{R}^n$, with $C^1$ boundary. For $f:U\rightarrow\mathbb{R}$, denote by $(f)_U = \frac{1}{\vert U\vert} \int_U f(x)\,dx$ the average of $f$ over $U$.

Theorem. There exists a constant $c=c(n,p,U)$ such that
\[\lVert f - (f)_U\rVert_{L^p(U)} \le c \, \lVert\nabla f\rVert_{L^p(U)},\]
for all $f \in W^{1,p}(U)$.

A simple case is when $n=1$, $p=2$, $f \in C^1$, and $U=[-r,r]$. Using the intermediate value theorem, the fundamental theorem of calculus and Hölder’s inequality, one can easily prove the result with the constant $c= 2r$. For a proof of the general result, see Evans §5.8.1.

One way to interpret the Poincaré inequality is as an isoperimetric inequality applied to the level sets of $f$. That is, just as one can control the area of a set via its perimeter (for example, a circle is the unique maximizer of area for a given perimeter), one can control the norm of $f$ via the norm of $\nabla f$. For a discussion of how one might obtain the Poincaré inequality from isoperimetry, see Nick Alger’s blog. To see how to recover the isoperimetric inequality from the Poincaré inequality, see Peter Luthy’s blog post.

The Poincaré constant

Aside from the trivial $n=1$ case, what can we say about the constant $c$? When can we recover the optimal constant? Consider the case $p=2$, so the right-hand side of the Poincaré inequality is the Dirichlet energy of $f$. The min-max principle says that the first eigenvalue of the negative Laplacian on $H_0^1(U)$ minimizes the Rayleigh quotient. That is, where $0 < \lambda_1 \le \lambda_2 \le \cdots$ are the eigenvalues of $-\Delta$,

\[\lambda_1 = \inf_{f \ne 0} \frac{\int_U \vert \nabla f\vert^2 dx}{\int_U \vert f \vert^2 dx}.\]

This recovers a Poincaré-type inequality with (optimal) constant $\lambda_1^{-1}$.

Note that $\lambda_1$ is related to the Cheeger constant, so we have yet another way to make the connection to isoperimetry. For an exposition of this connection in the (relatively) simple case of graphs, see §2.3 of Chung’s book on Spectral Graph Theory.

Probabilistic Poincaré inequality

What if we want to prove a Poincaré-type inequality, except we would like to integrate over a measure $\mu$ other than Lebesgue measure? In the PDE and numerical analysis literature, one can find references to “weighted Poincaré inequalities”, where $d\mu(x)=w(x)dx$ is absolutely continuous with respect to Lebesgue measure with density (or “weight”) $w$ on some bounded domain. Typically, these estimates follow from analytical methods.

Instead, let’s try a probabilistic approach. Let $(M,d)$ be a metric space, $\mu$ a probability measure on its Borel sets. Consider the case $p=2$, so that the $L^p$ norm of a measurable function $f$ is just the variance (of the random variable $f$). For some class of measurable functions $f$ on $(M,d)$, we’d like to show that there exists a constant $C$ such that

\[\begin{equation} \label{mupoinc} \textrm{Var}_\mu (f) \le C\,\mathbb{E}_\mu \vert \nabla f \vert^2 . \end{equation}\]

This inequality claims that the fluctuations of $f$ are controlled by how quickly $f$ can change. In particular, if $f$ is Lipschitz, then the variance is bounded by a constant! Analogous to the PDE case, “regularity” or “smoothness” (via the first derivative) gives an energy bound which restricts the possible behaviors of the system/function $f$.

Bounds on variance can be manipulated to develop concentration inequalities which bound the probability of $f$ deviating too far from its mean/median. Such estimates are a fundamental tool in statistics and machine learning (e.g., PAC learning, VC theory).

Gaussian Poincaré inequality

Theorem (GPI). Let $\mu$ be the standard Gaussian measure on $M=\mathbb{R}^n$. Assume $f:\mathbb{R}^n\rightarrow\mathbb{R}$ is $C^1$. Then, \eqref{mupoinc} holds with $C=1$.

Note that this inequality is tight! Consider $f(x) = x_1 + x_2 + \cdots + x_n$ and note that $\mu$ is a product measure so there is no covariance between cross terms.

Applications:

Consider the $(1+1)$-dimensional Gaussian polymer model. The variance of the ground state energy (the minimum of sums of Gaussians) is bounded by $n+1$, which is surprising given that the expected size grows linearly in $n$.
Consider the Sherrington-Kirkpatrick model with inverse temperature $\beta$. The variance of its free energy (the normalized log partition function) is bounded by $C(\beta)n$.
Let $(g_1,\cdots,g_n)$ be jointly Gaussian (possibly correlated). Then the variance of the maximum of $(g_i)$ is bounded by the maximum of the variances.

For these examples and others, the bounds produced by Gaussian Poincaré inequality are known to be suboptimal. For more on when/why this occurs, take a look at these notes on superconcentration.

Proof by Efron-Stein inequality

We first prove the Efron-Stein inequality, which will allow us to analyze the variability of $f$ in a coordinate-wise manner. Extending a result from one dimension to arbitrary dimension is an example of “tensorization”.

Let $X_1,\cdots,X_n$ be independent random variables such that $X_k$ takes values in some space $\mathcal{X}_k$, and $f:\prod_k \mathcal{X}_k \rightarrow\mathbb{R}$ a measurable function. Let $\mathbb{E}_i$ denote the expectation with respect to the $i$-th coordinate; that is,

\[\mathbb{E}_i f(X) = \mathbb{E}[f(X) \vert X_1,\cdots, X_{i-1}, X_{i+1}, \cdots, X_n],\]

and let $\textrm{Var}_i$ denote the conditional variance,

\[\textrm{Var}_i f(X) = \mathbb{E}_i \left[ (f(X) - \mathbb{E}_i f(X))^2 \right] .\]

Theorem.
\[\textrm{Var} \,f(X) \le \mathbb{E} \sum_{i=1}^n \textrm{Var}_i f(X)\]

Proof. First note that if $f(x) = \sum_{i=1}^n x_i$, we have an exact equality since $X_i - \mathbb{E} X_i$ are orthogonal in $L^2$. More generally, we’d like to bound the variance of $f$ by expressing $f(X) - \mathbb{E} f(X)$ as the sum of martingale differences, and somehow exploit the orthogonality of those differences.

Let $Z = f(X)$, $Y = Z - \mathbb{E}Z$, and

\[Y_i = \mathbb{E}_{(i+1):n} Z - \mathbb{E}_{i:n} Z\]

for $i=1,\cdots, n$. Then,

\[\textrm{Var} Z = \mathbb{E}\,Y^2 = \sum_{i=1}^n \mathbb{E}\, Y_i^2 + \sum_{i \ne j} \mathbb{E} Y_i Y_j\]

The cross terms evaluate to zero due to elementary properties of conditional expectation. As for the first sum, by Jensen’s inequality,

\[Y_i^2 = ( \mathbb{E}_{(i+1):n} ( Z -\mathbb{E}_{i} Z) )^2 \le \mathbb{E}_{(i+1):n} [ (Z - \mathbb{E}_{i} Z)^2]\]

$ \square$

Proof of GPI.

Suppose $X \sim \mu$ such that $\textrm{Var}_\mu(f) = \textrm{Var} f(X)$. Let $\mathbb{E} \lvert \nabla f(X) \rvert^2 < \infty$, since otherwise the result holds trivially. It is sufficient to consider the case $n=1$ since the Efron-Stein inequality allows us to analyze each coordinate separately. Also, suppose $f$ has compact support and is $C^2$ – otherwise, just approximate.

The key insight here is that a Gaussian random variable $X$ is just the scaling limit of a sum of mean 0 variance 1 random variables. Thus, to study the variance of a function of $X$, we can try to study the variance of a function of finite sums. To this end, let $\epsilon_1,\cdots,\epsilon_n$ be independent Rademacher random variables, and let $S_n = \frac{1}{\sqrt{n}} \sum_{j=1}^n \epsilon_j$. For all $i$,

\[\textrm{Var}_i f(S_n) = \tfrac{1}{4} \left[ f( S_n - \tfrac{\epsilon_i}{\sqrt{n}} + \tfrac{1}{\sqrt{n}} ) - f( S_n- \tfrac{\epsilon_i}{\sqrt{n}} - \tfrac{1}{\sqrt{n}} )\right]^2.\]

Apply Efron-Stein to get

\[\textrm{Var} f(S_n) \le \tfrac{1}{4}\sum_{i=1}^n \mathbb{E}\left[ f( S_n - \tfrac{\epsilon_i}{\sqrt{n}} + \tfrac{1}{\sqrt{n}} ) - f( S_n- \tfrac{\epsilon_i}{\sqrt{n}} - \tfrac{1}{\sqrt{n}} )\right]^2 .\]

The central limit theorem says $S_n \Rightarrow N(0,1)$, so $\textrm{Var} f(S_n) \rightarrow \textrm{Var} f(X)$. Let $K=\sup_x \vert f”(x)\vert$. By Taylor’s theorem,

\[\left\vert f( S_n - \tfrac{\epsilon_i}{\sqrt{n}} + \tfrac{1}{\sqrt{n}} ) - ( S_n - \tfrac{\epsilon_i}{\sqrt{n}} - \tfrac{1}{\sqrt{n}} ) \right\vert \le \tfrac{2}{\sqrt{n}} \vert f'(S_n)\vert + \tfrac{2K}{n} .\]

Then apply the CLT again,

\[\limsup_{n\rightarrow\infty}\tfrac{1}{4}\sum_{i=1}^n \mathbb{E}\left[ \left( f( S_n - \tfrac{\epsilon_i}{\sqrt{n}} + \tfrac{1}{\sqrt{n}} ) - f( S_n- \tfrac{\epsilon_i}{\sqrt{n}} - \tfrac{1}{\sqrt{n}} ) \right)^2\right] \le \mathbb{E}[f'(X)^2] .\]

$\square$

This proof (along with several related results) can be found in the book by Boucheron, Lugosi, Massart.

Proof by Markov semigroups

For a proof of a rather different flavor, we take a more dynamical view, and think of the Gaussian measure $\mu$ as the invariant measure of some system.

Let $(X_t)$ be a Markov process in some state space, $P_t$ its semigroup, $L$ its generator, $f$ an element of some appropriate domain of test functions. Suppose $(X_t)$ has an invariant measure $\mu$. Note that $\mu$ defines a natural $L^2$ space and an inner product, but we can define another bilinear form $\mathcal{E}$, the Dirichlet form:

\[\mathcal{E}(f,g) := -(f,Lg) = -\int f Lg \, d\mu.\]

If $L$ is self-adjoint (with respect to $\mu$ inner-product), the Dirichlet form is symmetric. Note that $L$ is self-adjoint when the Markov process $(X_t)$ is reversible. A related bilinear form is the covariance,

\[\mathrm{Cov}_\mu(f,g) := \int fg\,d\mu - \int f\,d\mu \, \int g\,d\mu\]

Covariance Lemma. Let $f,g\in L^2(\mu)$. Suppose when differentiating $(f,P_tg)$ with respect to $t$, we can move the derivative inside. Also, assume the “heat equation” $\partial_t P_t g = L P_t g$ holds. Then,
\[\mathrm{Cov}_\mu(f,g) = \int_0^\infty \mathcal{E}(f,P_tg)\,dt .\]

Proof. Note that $P_tg$ tends to $\mathbb{E}_\mu g$ in $L^2$ as $t\rightarrow\infty$. Thus,

\[\begin{align*} \mathrm{Cov}_\mu(f,g) &= (f,g) - (f,\mathbb{E}_\mu g)\\ &= \lim_{t\rightarrow\infty}\left[ (f,P_0g) - (f,P_tg)\right] \\ &= - \int_0^\infty \partial_t (f,P_tg)\, dt\\ &= - \int_0^\infty (f,\partial_t P_tg) \,dt\\ &= - \int_0^\infty (f,L P_tg)\,dt \\ &= \int_0^\infty \mathcal{E}(f,P_tg)\,dt \end{align*}\]

$\square$

What does $\mathcal{E}$ look like? Consider gradient diffusions. That is, for some potential $V$,

\[Lf = \nabla V \cdot \nabla f + \Delta f\]

One can show that this is the generator for the diffusion $dX_t = \nabla V(X_t) dt + \sqrt{2} dB_t$, and has invariant measure $\gamma(dx) = e^{V(x)}\,dx$. That is, $\int Lf \, e^V \,dx = 0$ for all $f$, or $L^\star e^V =0$. Using integration by parts (assuming appropriate boundary conditions on $f,g$),

\[\begin{align*} \mathcal{E}(f,g) &= - (f, Lg)_\gamma\\ &= -(f, \nabla V \cdot \nabla g + \Delta g)_\gamma\\ &= -(f,\nabla V \cdot \nabla g)_\gamma + (\nabla f + (\nabla V)f, \nabla g)_\gamma\\ &= -(f,\nabla V \cdot \nabla g)_\gamma + (f, \nabla V \cdot \nabla g)_\gamma + (\nabla f , \nabla g)_\gamma\\ &= (\nabla f, \nabla g)_\gamma \end{align*}\]

Proof of GPI

Consider the example of the OU operator, $L= -x \cdot \nabla + \Delta$. Then, the stationary distribution is the standard Gaussian $\mu$. Note that we can explicitly write the semigroup as $P_t f(x) = \mathbb{E}[f(e^{-t}x + \sqrt{1-e^{-2t}}Z)]$, so $\nabla P_t f = e^{-t} P_t \nabla f$. By the covariance lemma,

\[\begin{align*} \textrm{Var}_{\mu} f &= \int_0^\infty \mathcal{E}(f,P_tf) \,dt\\ &= \int_0^\infty (\nabla f , \nabla P_t f)_\mu\,dt \\ &= \int_0^\infty e^{-t} \mathbb{E}_{\mu} (\nabla f \cdot P_t \nabla f)\, dt\\ (\text{Cauchy-Schwarz}) &\le \int_0^\infty e^{-t} \mathbb{E}_{\mu} \vert\nabla f\vert\, \vert P_t \nabla f\vert\, dt\\ (\text{Hölder}) &\le \int_0^\infty e^{-t} \left(\mathbb{E}_{\mu} \vert\nabla f\vert^2 \, \mathbb{E}_{\mu} \vert P_t \nabla f\vert^2\right)^{1/2}dt \\ (\text{Jensen on } P_t) &\le \int_0^\infty e^{-t} \left(\mathbb{E}_{\mu} \vert\nabla f\vert^2 \, \mathbb{E}_{\mu} P_t \vert\nabla f\vert^2\right)^{1/2}dt\\ &=\mathbb{E}_{\mu}\vert\nabla f\vert^2 \int_0^\infty e^{-t} dt \\ &= \mathbb{E}_{\mu}\vert\nabla f\vert^2 \end{align*}\]

$\square$

Other Poincaré inequalities

Earlier, we claimed that we want to prove \eqref{mupoinc} for certain measures $\mu$. The GPI is one example, but it could be more useful to think of it as a special case of an inequality like

\begin{equation} \label{genpoinc} \textrm{Var}_\mu f \le C \, \mathcal{E}(f,f) \end{equation} where $\mathcal{E}$ is the Dirichlet form associated with some Markov process with invariant measure $\mu$.

For example, in the case of finite-state Markov chains, one can analyze the constant $C$ using the canonical paths method. For Poincaré inequalities on Markov random fields, see Wu AoP 2006.

It is possible to prove that $-L$ is positive semi-definite, so it is reasonable to ask whether its spectrum $\lambda_0 \le \lambda_1 \le \cdots $ encodes nice properties. Note that constant functions are in the null space of $L$, and in fact one can prove that the eigenspace of $\lambda_0= 0$ is one-dimensional. Using Plancherel identity in $L^2(\mu)$, one can show that a Poincaré inequality \eqref{genpoinc} holds iff $\lambda_1$ is strictly positive, in which case the optimal constant is $C = \frac{1}{\lambda_1}$.

The persistent appearance of the spectral gap $\lambda_1$ is suggestive of the deep connection between concentration and ergodicity. In some sense, both notions encode some sort of rigidity of a system. For example,

Theorem. Let $P_t$ be a Markov semigroup with stationary measure $\mu$. Let $f$ in the domain of $L$, and $C> 0$. The following are equivalent:

(Poincaré inequality) $\mu$ satisfies \eqref{genpoinc} with constant $C$.

(Exponential decay) For all $t$,

\[\left\lVert P_tf - \mathbb{E}_\mu f \right \rVert_{L^2(\mu)} \le e^{- t/C} \lVert f - \mathbb{E}_\mu f\rVert_{L^2(\mu)}.\]

Proof.

Assume 1. Then,

\[\frac{d}{dt}\textrm{Var}_\mu (P_tf ) = -2\mathcal{E}(P_tf,P_tf) \le -\frac{2}{C} \textrm{Var}_\mu (P_tf)\]

where the first equality comes from the definition of Dirichlet form and stationary measure $\mu$, and the second is the Poincare inequality. Thus, $\textrm{Var}_\mu (P_t f) \le e^{-2t/C} \textrm{Var}_\mu f$, and note that $\mathbb{E}_\mu f = \mathbb{E}_\mu P_tf$.

Assume 2. Then, using the equality from the previous part again at time $t=0$,

\[2 \mathcal{E}(f,f) = - \lim_{t\downarrow 0} \frac{\textrm{Var}_\mu(P_tf) - \textrm{Var}_\mu f }{t} \ge \textrm{Var}_\mu f \, \lim_{t\downarrow 0} \frac{1 - e^{-2t/C}}{t} = \frac{2}{C} \textrm{Var}_\mu f\]

$\square$.

For more on this connection, see Liggett AoP 1989. For a more recent paper on this interplay, see Röckner, Wang JFA 2001 or Bakry, Cattiaux, Guillin JFA 2008.

A great overall reference is the recent book by Bakry, Gentil, Ledoux.

Poincaré Inequalities was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.07.23.

Welcome!

2014-07-22T00:00:00-04:00

Hello! I plan to use this blog as a little cache where I can tuck away things which I find interesting. This might include ideas of my own, summaries of papers, surveys of classical material, or just cool tidbits I learn from others. I’m sure the direction and tone will develop over time.

For the curious (and as a reminder to myself), I built this blog using Jekyll and the So Simple Theme. I’m hosted on Github Pages. The math on this site is rendered through MathJax. The domain .im is the ccTLD for the Isle of Man.

Theorem Blogging is cool.

Proof: Trivial.

Welcome! was originally published by Steven Soojin Kim at Steven Soojin Kim on 2014.07.22.

Steven Soojin Kim

Thinning PRMs

Control theory today

Applied Math Retreat

Random Schrodinger Operators

Integrated Density of States

Thouless Formula

Weak solutions of SDE

Background

Strong solution

Weak solution

Simple example

Weak but not strong

Tanaka example

Tsirelson Example

Discrete time analog

Sparse PCA

Relaxations in Optimization

$\mathbb{R}$ vs. $\mathbb{Z}$: LP Relaxations of IP

The Great Watershed: Convex Relaxation

Semidefinite programming

Goemans, Williamson (MAX CUT)

Sparsity, Statistics, and Selection

LASSO

Matrix Completion

Related Notions

Poincaré Inequalities

The (classical) Poincaré inequality

The Poincaré constant

Probabilistic Poincaré inequality

Gaussian Poincaré inequality

Proof by Efron-Stein inequality

Proof by Markov semigroups

Other Poincaré inequalities

Welcome!