# Category:Problems

(Redirected from BP)

## Sparse approximation problems

### Best subset selection

This problem is in general NP-hard. Many of the algorithms presented on this wiki are attempts to solve this problem approximately, or to solve it in certain special cases. We denote the $\ell_0$ quasi-norm to be the number of non-zeros of a vector.

$\min_x \|x\|_0 \quad\text{subject to}\quad Ax=b$.

There are also variants that account for noise, or that flip the objective and constraint:

$\min_x \frac{1}{2}\|Ax-b\|_2^2 \quad\text{subject to}\quad \|x\|_0 \le k$.

### Basis Pursuit (BP)

This is one of the best-known sparse recovery problems. The matrix A is usually underdetermined or ill-conditioned. This problem is useful for finding a solution x that is sparse.

$\min_x \|x\|_1 \quad\text{subject to}\quad Ax=b$

#### Synthesis variant

Instead of looking for a sparse solution x, we instead look for a sparse representation α in a basis or overcomplete dictionary Ψ. The problem is

$\min_\alpha \|\alpha\|_1 \quad\text{subject to}\quad A\Psi\alpha=b$

and then x is recovered via x = Ψα.

#### Analysis variant

Again, we expect x to have a sparse representation in a dictionary Ψ, but this time the problem uses the analysis operator $\Psi^\dagger$. If Ψ is a tight frame, then this is just the adjoint/transpose: $\Psi^\dagger = \Psi^*$:

$\min_x \|\Psi^\dagger x\|_1 \quad\text{subject to}\quad Ax=b$

This is equivalent to synthesis when Ψ is a basis (i.e. when its matrix form is invertible), but otherwise it is not equivalent.

### Basis Pursuit Denoising (BPDN)

This is a variant of BP that is used when measurements are inexact or noisy in some way:

$\min_x \|x\|_1 \quad\text{subject to}\quad \|Ax-b\|_2 \le \epsilon.$

There are synthesis and analysis variants of BPDN in the analogous fashion to BP.

### Lagrangian form of BPDN (LAG), aka the LASSO

This is an equivalent form to BPDN for the correct choice of λ = λ(ε). It is sometimes algorithmically easier to solve than BPDN, but the precise relation between λ and ε is in general unknown (unless you have already solved the problem).

$\min_x \|x\|_1 + \frac{\lambda}{2} \|Ax-b\|_2^2.$

Again, there are analysis and synthesis variants.

Since this form is somewhat easier to solve than BPDN, it is the most popular, and when people refer to "the LASSO", the usually mean LAG.

### Original LASSO formulation of BPDN

The term "LASSO" is used differently by different people, so we avoid it on this wiki. The "original LASSO" or "Tibshirani's LASSO" (TibLASSO) refers to the form proposed by Tibshirani in his classic paper from the mid '90s. It is equivalent to BPDN and LAG for some choice of parameter τ, but this choice of parameter is in general unknown (unless you know a solution to the problem).

$\min_x \frac{1}{2}\|Ax-b\|_2^2 \quad\text{subject to}\quad \|x\|_1 \le \tau.$

### Elastic Net

The Elastic Net was proposed in a statistical framework to overcome limitations to the LASSO. Below, we will keep to our earlier terminology (variable $x \in \mathbb{R}^N$, matrix $A \in \mathbb{R}^{M \times N}$) and convert the standard statistics notation to our notation.

Some possible deficiencies of the LASSO in statistical regression:

• The LASSO selects a sparse solution in general, and it has at most M non-zeros. This can sometimes be limiting.
• If a group of variables are highly correlated pairwise, the LASSO often picks just one of them, and sets the other to zero. For interpreting results, it would be more useful to select each member of the group.
• In the over-determined case, if A has high correlations in the columns, the LASSO performs worse than standard ridge regression, according to Tibshirani's original '96 LASSO paper.

To overcome these perceived problems, Zou and Hastie [1] propose the Elastic Net. They start by defining the "naive" elastic net:

$x_\text{naive Elastic Net} = \text{argmin} \|Ax-b\|_2^2 + \lambda_1 \|x\|_1 + \lambda_2 \|x\|_2^2.$

Assuming that one knows the values of λ1 and λ2, then this problem can be solved by using a standard LASSO solver, since the $\|x\|_2^2$ term can be absorbed into the $\|Ax-b\|_2^2$ term by augmenting the A matrix to become [AI] where I is the $N \times N$ identity matrix and τ is an appropriate scalar, and by similarly augmenting b to become [b;zeros(n,1)].

If good values of λ1 and λ2 are not known, then other solution techniques that generate solutions for a range of λ may be preferred.

Zou and Hastie suggest that the naive estimator produces values that are too small, so the true Elastic Net is a scaled version of the naive estimator:

$x_\text{Elastic Net} = (1+\lambda_2) x_\text{naive Elastic Net}.\,$

### Fused Lasso

Before defining the Fused Lasso, consider the original Lasso (TibLASSO) in Tibshirani's original form:

$\min_x \frac{1}{2}\| Ax - b \|_2^2 \quad\text{such that}\quad \|x\|_1 \le \tau.$

where $A \in \mathbb{R}^{m \times n}$ and, as usual, $\|x\|_1 = \sum_{i=1}^N |x_i|$.

The Fused Lasso modifies this by adding a penalty on the difference of the coefficients of x. The problem is: $\min_x \frac{1}{2}\| Ax - b \|_2^2 \quad\text{such that}\quad \|x\|_1 \le \tau_1 \quad\text{and}\quad \sum_{i=2}^N |x_i-x_{i-1}| \le \tau_2.$

The effect is to not only encourage sparsity in the coefficients, but also sparsity in the difference of neighboring coefficients, and thus coefficients are more likely to clump together (either several coefficients in a row are zero, or they are nearly the same value).

The Fused Lasso was first proposed in [2]

### Dantzig selector (DS)

Proposed by Candès and Tao:

$\min_x \|x\|_1 \quad\text{subject to}\quad \|A^*(Ax-b)\|_\infty \le \delta.$

If there is no noise, then take δ = 0, in which case it is the same as BP whenever A has full row rank (which it usually does for practical problems). The different form of the constraint is motivated by the optimality conditions of the dual vector.

### Total Variation (TV) problems

These come in many variants (depending on the constraints), but they all have in common the total-variation norm. This can be seen as a weighted norm of complex numbers. For a TV problem, we have a vector $x \in \mathbb{R}^n$ but it will help to think of this as a matrix (i.e. an image) $X \in \mathbb{R}^{n_1 \times n_2 }$ (with n = n1n2 ). Replace $\|x\|_1$ in the above problems with

$\|x\|_{TV} = \sum_{i=1}^{n_1-1} \sum_{j=1}^{n_2-1} \sqrt{ (X(i+1,j)-X(i,j))^2 + (X(i,j+1)-X(i,j))^2 }$

Sometimes this is referred to as the "isotropic TV norm", as opposed to the "anisotropic TV norm", defined as:

$\|x\|_{ATV} = \sum_{i=1}^{n_1-1} \sum_{j=1}^{n_2-1} \sqrt{ (X(i+1,j)-X(i,j))^2 }+ \sqrt{(X(i,j+1)-X(i,j))^2 }$

The isotropic TV norm is also sometimes regularized by adding a small nonzero offset under the square root, in order to make it differentiable everywhere. When we refer to just "TV", we mean the non-smoothed, isotropic TV norm, unless otherwise specified. The classic algorithm for solving TV regularized denoising problems is "Chambolle's Algorithm": it is basically the proximity function for TV. There is no known closed-form solution for the TV proximity function.

There are other variants of TV that are sometimes easier to work with; for example, smoothing the norm in different fashions (corresponding to a different finite-difference stencil), or assuming periodic boundary conditions (which makes it amenable to analysis with the FFT).

The PPXA paper, by Pustelnik, Chaux and Pesquet, discusses some TV variants using filter such as the Roberts filter, a first-order finite difference filter that is different than the one above, Prewitt filters, and Sobel filters. These filters allow one to decompose the TV function into a sum of a few auxiliary functions; these auxiliary functions admit a closed-form proximity operator, although the sum of the functions still does not.

### Variants

These variations can apply to most of the problems above.

#### Weighted l1 norm

Weighted $\ell_1$ norm: replace $\|x\|_1$ with $\|Wx\|_1$ for some weight matrix $W \in \mathbb{R}^{p \times n}$. Almost all of the algorithms can handle (or be modified to handle) this case when W is square and non-singular, or especially when it is orthogonal, so we don't point out this case. But we do note when algorithms can handle this variant for arbitrary W, e.g. when $p \gg n$ as in when W is the analysis operator for a frame.

#### Block l1 norm

Block norms, for the case of recovering multiple signals simultaneously: replace $\|x\|_1$ with $\|X\|_{q,1} = \sum_{i=1}^n \|X(i,:)\|_q$ where $X \in \mathbb{R}^{n \times d }$, using Matlab notation. Typically q = 2 or $q = \infty$. Also known as "multiple measurement vectors (MMV)" and used with fusion frames. The constraint norm becomes the matrix Frobenius norm instead of the $\ell_2$ norm. Most algorithms that work with $\ell_1$ will also work with the block norms (some may require you to modify them manually though).

## Low-rank problems

### Matrix completion (MC)

Let $X \in \mathbb{R}^{n_1 \times n_2}$ be a matrix, and Xi,j denote the (i,j) entries. The problem of matrix completion is to estimate the matrix X given an incomplete set of entries. The incomplete set is denoted Ω, so $(i,j) \in \Omega$ means that the Xi,j entry is observed.

Recovering a matrix from incomplete entries is impossible unless there is further restrictions on the matrix. Usually the restriction is in the form of an assumption about the rank of the matrix. It is generally considered desirable to solve this problem:

$\min_X \,\text{rank}(X) \quad\text{subject to}\quad X_{i,j} = Y_{i,j} \;\forall (i,j) \in \Omega$

where Y contains the partial observations. In general, this problem is not computationally feasible. When people refer to Matrix Completion, it sometimes refers to any heuristic method to solve the rank minimization problem, but it can also refer to the following nuclear-norm minimization problem:

$\min_X \,\|X\|_* \quad\text{subject to}\quad X_{i,j} = Y_{i,j} \;\forall (i,j) \in \Omega$

The nuclear norm (aka trace norm aka Schatten norm with p = 1) is the sum of the singular values.

For convenience, let $\mathcal{A}_\Omega(X)$ be the sampling operator that picks out the entries in Ω, so that the constraint can be written $\mathcal{A}_\Omega(X) = \mathcal{A}_\Omega(Y).$

Some algorithms can deal with general linear operators $\mathcal{A}$. If $\mathcal{A}$ satisfies a generalization of the RIP, then most compressed sensing results apply (see Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization by Recht, Fazel and Parrilo arXiv:0706.4138 ). The subsampling operator $\mathcal{A}_\Omega(X)$ does not satisfy RIP bounds, so proof techniques are quite different, and reconstruction is often more difficult as well.

See pages that link to this problem (some might also be under this MC alias ).

#### Noisy matrix completion (NMC)

This is a slight variant of the nuclear-norm minimization problem that allows for noisy data. To account for noise, the constraints are relaxed from equality constraints:

$\min_X \,\|X\|_* \quad\text{subject to}\quad \|\mathcal{A}_\Omega(X) - \mathcal{A}_\Omega(Y)\|_F \le \epsilon$

It is common to use the Frobenius norm for the constraints since this is the maximum likelihood term for iid white noise, though other choices are possible.

#### Lagrangian form of noisy matrix completion (LAG-MC)

Similar to noisy matrix completion, but with the constraints put into an objective term:

$\min_X \,\|X\|_* + \frac{\lambda}{2} \|\mathcal{A}_\Omega(X) - \mathcal{A}_\Omega(Y)\|_F^2$

For some choice of λ and ε, this is equivalent to NMC, but the exactly equivalent parameters are generally not known until after one has solved either problem.

### Robust PCA

The classic PCA technique finds a low-dimensional representation of a dataset, but it is sensitive to outliers in the data. There have been many variants of PCA that attempt to make it more robust to outliers, so there is no consensus on what the robust PCA (RPCA) technique means. However, in the field of matrix completion, RPCA usually refers to the following problem:

$\min_{S,L}\, \|L\|_* + \alpha \|S\|_1 \quad\text{subject to}\quad S+L = Y$

where $\|L\|_*$ is the nuclear norm of L and $\|S\|_1$ is the sum of absolute values of the entries of S (that is, think of S as a vector). The nuclear norm will often induce the rank of L to be low, and the l1 norm will induce the number of nonzeros of S to be small. The idea is that L captures the predictable, low-rank structure of the data, and S contains outliers.

For an idea of what RPCA can be used for, see the demo using TFOCS.

### Low-rank representation (LRR)

This is another generalization of PCA. We seek a low-rank representation of a data matrix, and not only allow outliers (like the RPCA described above) but also allow simple transformations. The problem is:

$\min_{S,L}\, \|L\|_* + \alpha \|S\|_{2,1} \quad\text{subject to}\quad S+AL = Y$

where A is some known transformation (if it is not known, then it can be estimated at a separate step, and generally one would iterate estimating A and solving the LRR problem, much as in dictionary learning). For background, see Robust Recovery of Subspace Structures by Low-Rank Representation by Liu, Lin, Yan, Sun, Yu and Ma. Yi Ma's group has worked on many similar variants, such as the ones in TILT: Transform Invariant Low-rank Textures (Zhang, Ganesh, Liang, Ma) and RASL: Robust Alignment by Sparse and Low-rank Decomposition for Linearly Correlated Images (Peng, Ganesh, Wright, Xu, Ma).

## References

1. Zou and Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B (2005) 67, Part 2, pp. 301--320 link
2. Tibshirani, Saunders, Rossets, Zhu and Knight, Sparsity and smoothness via the fused lasso, J. Royal Stat. Soc. Ser. B 67, 2005, 91--108. link.

## Below is a list of pages in this category

This category currently contains no pages or media.