sparse linear optimization

「稀疏一次最佳化」。一次最佳化，令最佳解是稀疏向量。

min ‖Ax - b‖₂²   subject to ‖x‖₀ ≤ k

以下將介紹基礎知識、相關問題。

sparsity

「稀疏度」。尚無正式中譯。向量當中，非零元素數量。

          ⎡ 2 ⎤
sparsity( ⎢ 0 ⎥ ) = 3  ,  ‖x‖₀ = 3
          ⎢ 1 ⎥
          ⎣ 9 ⎦

向量可以推廣成矩陣、張量。

rank

「秩」。矩陣當中，向量們以四則運算相消，非零向量數量。

      ⎡ 1 2 3 4 ⎤
rank( ⎢ 2 2 3 4 ⎥ ) = 2  ,  rank(A) = 2
      ⎢ 3 2 3 4 ⎥
      ⎣ 4 2 3 4 ⎦

矩陣可以推廣成張量。

vector norm

向量的長度函數。定義為元素的絕對值的p次方的總和的1/p次方。

‖x‖₀ = ∑|xᵢ|⁰    L⁰-norm (sparsity): count of non-zero elements
‖x‖₁ = ∑|xᵢ|¹    L¹-norm: sum of absolute value of elements
‖x‖₂ = √∑|xᵢ|²   L²-norm: squared root of squared sum

L⁰-norm其實不是norm。方便起見，大家依然稱呼為norm。

matrix norm

矩陣的長度函數，定義為奇異值們的長度函數。

rank(A) = ∑|σᵢ|⁰   rank: L⁰-norm of singular values
‖A‖⁎ = ∑|σᵢ|¹      nuclear norm: L¹-norm of singular values
‖A‖ꜰ = √∑|σᵢ|²     Frobenius norm: L²-norm of singular values

‖A‖⁎ = tr(sqrt(AᵀA))
‖A‖ꜰ = sqrt(tr(AᵀA)) = √∑|Aᵢⱼ|²

matrix approximation

sparse vector approximation
（best k-term approximation）

已知原向量y，求新向量x，指定稀疏度k。

 argmin ‖x - y‖₂²
‖x‖₀ ≤ k

x盡量不變：對應元素相減平方總和盡量小。

y的稀疏度可能先天短少：數學式子寫成≤ k而非= k。

正確答案很簡單：保留前k大的元素，其餘元素歸零。

low-rank matrix approximation
（best rank-k approximation）

已知原矩陣M，求新矩陣X，指定秩大小k。

  argmin ‖X - M‖ꜰ²
rank(X) ≤ k

X盡量不變：對應元素相減平方總和盡量小。

M的秩可能先天短少：數學式子寫成≤ k而非= k。

正確答案是奇異值分解：保留前k大的奇異值，其餘奇異值歸零。證明請見「Eckart–Young theorem」。

M = UΣVᵀ
X = UΣₖVᵀ

vector regularization

追加目標函數。有公式解：根據權重來篩選每個元素。

https://math.stackexchange.com/questions/2713427/
http://www.neulx.com/Soft-thresholding/

L⁰-norm regularization
min ‖x - y‖₂² + α ‖x‖₀     (α ≥ 0)

solution
x = H_√α(y)     (apply function H to each element of y)

hard thresholding operator
H_ε(x) = ⎰ x   if |x| > ε
       ⎱ 0   otherwise

L¹-norm regularization
min ‖x - y‖₂² + α ‖x‖₁     (α ≥ 0)

solution
x = S_½α(y)

soft thresholding operator
S_ε(x) = ⎧ x - ε   if x > ε   = sign(x) max(|x|-ε, 0)
       ⎨ x + ε   if x < -ε
       ⎩ 0       otherwise

L²-norm regularization
min ‖x - y‖₂² + α ‖x‖₂²     多了二次方，湊公式解

solution
x = T_α(y)

??? operator
T_ε(x) = x / (1 + ε)

matrix regularization

追加目標函數。有公式解：根據權重來篩選每個奇異值。

https://books.google.com.tw/books?id=I9H7CwAAQBAJ&pg=PA50

rank regularization
min ‖X - M‖ꜰ² + α rank(X)     (α ≥ 0)

solution
X = UH_√α(Σ)Vᵀ = 𝓗_√α(M) = U(Σ-αI)₊V

hard thresholding operator
H_ε(x) = ⎰ x   if |x| > ε
       ⎱ 0   otherwise

singular value hard thresholding operator
𝓗_ε(X) = UH_ε(Σ)Vᵀ

nuclear norm regularization
min ‖X - M‖ꜰ² + α ‖X‖⁎     (α ≥ 0)

solution
X = US_½α(Σ)Vᵀ = 𝓢_½α(M)

soft thresholding operator
S_ε(x) = ⎧ x - ε   if x > ε   = sign(x) max(|x|-ε, 0)
       ⎨ x + ε   if x < -ε
       ⎩ 0       otherwise

singular value soft thresholding operator
𝓢_ε(X) = US_ε(Σ)Vᵀ

Frobenius norm regularization
min ‖X - M‖ꜰ² + α ‖X‖ꜰ²     多了二次方，湊公式解

solution
X = UT_α(Σ)Vᵀ = 𝓣_α(M)

??? operator
T_ε(x) = x / (1 + ε)

singular value ??? operator
𝓣_ε(X) = UT_ε(Σ)Vᵀ

matrix recovery🚧

sparse vector recovery
（L⁰-norm minimization）

新舊向量相減x - y，推廣成一次函數Ax - b。

已知變換矩陣A、向量b，求原始向量x，指定稀疏度k。

 argmin ‖Ax - b‖₂²
‖x‖₀ ≤ k

請見後面章節vector norm minimization。

low-rank matrix recovery
（rank minimization）

新舊矩陣相減X - M，推廣成一次函數AX - B。

已知變換矩陣A、矩陣B，求原始矩陣X，指定秩大小k。

  argmin ‖AX - B‖ꜰ²
rank(X) ≤ k

請見後面章節matrix norm minimization。

【尚待確認：AX = B應該要改成A(X) = b？】

vector regularization

https://stats.stackexchange.com/questions/17781/
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/UMAR/AVassign2.pdf

sparse regularization: min ‖Ax - b‖₂² + α ‖x‖₀
 lasso regularization: min ‖Ax - b‖₂² + α ‖x‖₁
 ridge regularization: min ‖Ax - b‖₂² + α ‖x‖₂²   多了二次方，湊公式解

L⁰-norm是NP-hard問題，L¹-norm和L²-norm是P問題。

當A是RIP矩陣，則是P問題。【尚待確認】

L⁰-norm和L¹-norm沒有公式解，L²-norm有公式解。

當A是正規正交矩陣，則有公式解。

matrix regularization

          rank regularization: min ‖AX - B‖ꜰ² + α rank(X)
  nuclear norm regularization: min ‖AX - B‖ꜰ² + α ‖X‖⁎
Frobenius norm regularization: min ‖AX - B‖ꜰ² + α ‖X‖ꜰ²

走火入魔。

restricted nullspace property

null(A) = kernel(A) = {x : Ax = 0}
Ax = b always has unique solution iff null(A) = ∅

大一數學。

Σ(k) = {x : ‖x‖₀ = k}
Ax = b subject to ‖x‖₀ = k
always has unique solution iff null(A) ∩ Σ(2k) = ∅

兩個稀疏度為k的向量，加減之後，稀疏度可能是0到2k。

T(x) = {y : ‖x + y‖₁ ≤ ‖x‖₁}
min ‖x‖₁ subject to Ax = b
always has unique solution iff null(A) ∩ T(x) = ∅

T(x)：加上y之後可以湊出更小的答案。

C(S) = {x : ‖x_Sᶜ‖₁ ≤ ‖x_S‖₁}
min ‖x‖₁ subject to Ax = b and ‖x‖₀ ≤ k
always has unique solution iff null(A) ∩ C(S) = ∅ for all |S| ≤ k

S是k個座標軸。x_S只使用那k個座標，稀疏度頂多是k。x_Sᶜ只使用那k個以外的座標。

C(S)：其餘座標軸可以湊出更小的答案。

由於null(A)可以運用QR分解求得，因此此定理具備實際用處？【尚待確認】

restricted isometry property

restricted isometry property
(1-δ) ‖x‖² ≤ ‖Ax‖² ≤ (1+δ) ‖x‖²   when ‖x‖₀ = k

Johnson–Lindenstrauss lemma
(1-δ) ‖x-y‖² ≤ ‖f(x)-f(y)‖² ≤ (1+δ) ‖x-y‖²

RIP矩陣是多變量平緩函數Lipschitz function？【尚待確認】

(1-δ) ≤ eigenvalue(A'ᵀA') ≤ (1+δ)     A' is any subset of vectors of A

想當然爾，特徵值也在δ倍以內。

RIP是不是定義成奇異值比較好呢？【尚待確認】

‖A'ᵀA' - I‖₂ ≤ δ
https://ee227c.github.io/notes/ee227c-lecture22.pdf

移項之後，得到coherent。

不動點遞推法（梯度下降法）收斂性證明。啊就步伐越來越小。

RIP = RSC + RSM。平緩函數硬是看成凸函數。小可仔走味。

spark(A): minimal number of linear dependent columns of A
          (L⁰-norm minimization when b = 0)
rank(A): maximal number of linear independent columns of A
         (gaussian elimation)
http://wwwopt.mathematik.tu-darmstadt.de/spear/slides/Tillmann_Talk_2013_SPARS_Lausanne.pdf

RIP certification is NP-hard. Find smallest δ is NP-hard.

Is any polynomial-time algorithm for spark(A) when A satisfies RIP?【尚待確認】

RIP implies RNSP when δ_2k(A) ≤ 1/3
http://www.cs.cmu.edu/~pradeepr/716/notes/lec11.pdf

這啥玩意？

[Candès 2008]
let x* is the solution of min ‖x‖₀ subject to Ax = b
when ‖x*‖₀ = k
1. if δ_2k(A) ≤ 1
   then x* is the unique solution of min ‖x‖₀ subject to Ax = b
2. if δ_2k(A) ≤ sqrt(2) - 1
   then x* is the solution of min ‖x‖₁ subject to Ax = b
https://coral.ise.lehigh.edu/katyas/files/OPTML/Lecture17.pdf

當倍率δ足夠小，那麼L⁰-norm與L¹-norm的極值相等，而且是唯一解。

然而δ無法迅速求得，導致此定理沒有實際用處。

歷經那漫長地婉轉地靈巧地推導，仍未填滿心中那炙熱的疑惑，只徒留淡淡的哀傷。多少的算式已難追憶，不如歸去，不如歸去。

圖解

minimization：L¹-norm包含了L⁰-norm的其中一解？

minimization
min ‖x‖₁   subject to Ax = b

         ^  __-- Ax = b feasible region
        _|-'
    _--'/|\
 --'   //|\\  ‖x‖₁ ≤ k level set
---------+--------->
       \\|//
        \|/
         |

sparsification/approximation：當橢圓位置足夠高、橢圓傾斜角度足夠好，L¹-norm包含了L⁰-norm的其中一解？

sparsification
min ‖x‖₁   subject to ‖Ax - b‖₂² ≤ ε

         ^  
         | (  )  ‖Ax - b‖₂² ≤ ε feasible region
        /|\ 
       //|\\  ‖x‖₁ level set
---------+--------->
       \\|//
        \|/
         |

approximation
min ‖Ax - b‖₂²   subject to ‖x‖₁ ≤ k

         ^
         | ((o))  ‖Ax - b‖₂² level set
        /|\ 
       / | \  ‖x‖₁ ≤ k feasible region
---------+--------->
       \ | /
        \|/
         |

vector norm minimization

L⁰-norm minimization（sparse recovery）

一次方程組求解，解盡量稀疏。NP-hard問題。

聯立風格
⎰ solve Ax = b                   [underdetermined system]
⎱ minimize ‖x‖₀

約束最佳化風格
min ‖x‖₀   subject to Ax = b     [underdetermined system]

於是大家只好採用最佳化的套路，得到三種問題。

sparsification  min ‖x‖₀   subject to ‖Ax - b‖₂² ≤ ε
approximation   min ‖Ax - b‖₂²   subject to ‖x‖₀ ≤ k
regularization  min ‖Ax - b‖₂² + α ‖x‖₀

三種問題不等價。舉例來說，approximation無法徹底轉化成regularization。

approximation (λ is unknown argument)
min ‖Ax - b‖₂²   subject to ‖x‖₀ ≤ k
min ‖Ax - b‖₂²   subject to ‖x‖₀ - k ≤ 0
min ‖Ax - b‖₂² + λ(‖x‖₀ - k)
min ‖Ax - b‖₂² + λ ‖x‖₀

regularization (α is known parameter)
min ‖Ax - b‖₂² + α ‖x‖₀

雖然理論上不等價，但是實務上幾乎等價。馬馬虎虎啦。方便起見，大家習慣把這三個問題和原問題視作相同問題，其演算法互通。

演算法（iterative thresholding）

投影梯度法。朝著梯度方向走一步，然後立即投影到約束條件。

approximation
min ‖Ax - b‖²   subject to ‖x‖₀ ≤ k

projected gradient descent
⎰ x := x - step ⋅ 2Aᵀ (Ax - b)
⎱ x := P(x)       ^^^^^^^^^^^^ gradient

P is best k-term approximation

鄰近梯度法。朝著梯度方向走一步，然後追加函數、調整位置。

regularization
min ‖Ax - b‖₂² + α ‖x‖₀

proximal gradient descent
⎰ x := x - step ⋅ 2Aᵀ (Ax - b)
⎱ x := H_√α(x)

H is hard thresholding operator

兩種問題，不同算法。因為作者是同一人，所以名稱取同一個。

‖x‖₀不是凸函數，演算法不會收斂至最小值。雖快但不準。

演算法（alternating direction method of multipliers）

交替方向乘數法。約束條件事先併入目標函數。

minimization
min ‖x‖₀   subject to Ax = b

indicator
min f(x) + g(x)   where f(x) = ‖x‖₀ , g(x) = ⎰ 0   if Ax = b
                                             ⎱ ∞   otherwise
splitting
min f(x) + g(y)   subject to x - y = 0

lagrangian with penalty
min f(x) + g(y) + λ (x - y) + (ρ/2) ‖x - y‖₂²

scaled form
min f(x) + g(y) + (ρ/2) ‖x - y + μ‖₂² - (ρ/2) ‖μ‖₂²   where μ = λ/ρ

admm
⎧ x := argmin f(x) + (ρ/2) ‖x - y + μ‖₂²
⎨ y := argmin g(y) + (ρ/2) ‖x - y + μ‖₂²
⎩ μ := μ + step ⋅ [ρ (x - y + μ) - ρ μ]

admm
⎧ x := H_√2/ρ(y + μ)
⎨ y := (I - Aᵀ(AAᵀ)⁻¹A)(x + μ) + Aᵀ(AAᵀ)⁻¹b
⎩ μ := μ + x - y   let step ⋅ ρ = 1

H is hard thresholding operator

交替方向乘數法。

regularization
min ‖Ax - b‖₂² + α ‖x‖₀

regularization
min f(x) + g(x)   where f(x) = ‖Ax - b‖₂² , g(x) = α ‖x‖₀

splitting
min f(x) + g(y)   subject to x - y = 0

lagrangian with penalty
min f(x) + g(y) + λ (x - y) + (ρ/2) ‖x - y‖₂²

scaled form
min f(x) + g(y) + (ρ/2) ‖x - y + μ‖₂² - (ρ/2) ‖μ‖₂²   where μ = λ/ρ

admm
⎧ x := argmin ‖Ax - b‖₂² + (ρ/2) ‖x - y + μ‖₂²
⎨ y := argmin α ‖y‖₀ + (ρ/2) ‖x - y + μ‖₂²
⎩ μ := μ + step ⋅ [ρ (x - y + μ) - ρ μ]

admm
⎧ x := (AᵀA + ρI)⁻¹ (Aᵀb + ρ(y - μ))
⎨ y := H_√2α/ρ(x + μ)
⎩ μ := μ + x - y   let step ⋅ ρ = 1

H is hard thresholding operator

兩種問題，同樣算法，不同過程。

‖x‖₀不是凸函數，演算法不會收斂至最小值。雖快但不準。

演算法（proximal ADMM）

每道限制條件，追加拉格朗日乘數，併成向量λ。

照理來說，x必須一路走到最小值，但是此處只走一步。我不確定這行不行的通。

minimization
min ‖x‖₀   subject to Ax = b

lagrangian with penalty
L(x, λ) = ‖x‖₀ + λᵀ (Ax - b) + (ρ/2) ‖Ax - b‖₂²

proximal gradient method & admm
⎧ x := Aᵀλ
⎨ x := H₁(x)
⎩ λ := λ + step ⋅ (Ax - b)

H is hard thresholding operator

‖x‖₀不是凸函數，演算法不會收斂至最小值。雖快但不準。

演算法（orthogonal matching pursuit）

類似於最小餘數法。找到跟餘數最接近的向量，作為變換矩陣。重複k回合。

用來快速求得近似解。

一、初始化。
　甲、r := b
　乙、A' := null
二、重複以下步驟k回合。
　甲、從A挖掉一個向量放入A'，該向量跟r的點積（相關度）盡量大。
　乙、r := b - A' x = b - A' A'⁺ b

演算法（compressive sampling matching pursuit）

改良版。

一口氣取2k個向量當作A'，該向量們跟r的點積（相關度）是前2k名。
算好 A'⁺ b 之時，再篩出k個當作x。

當A是RIP矩陣，此演算法可以得到最佳解嗎？【尚待確認】

L¹-norm minimization（basis pursuit）

走火入魔。然而是P問題。

聯立風格
⎰ solve Ax = b                   [underdetermined system]
⎱ minimize ‖x‖₁

約束最佳化風格
min ‖x‖₁   subject to Ax = b     [underdetermined system]

仍然可以採用最佳化的套路，得到三種問題。

sparsification  min ‖x‖₁   subject to ‖Ax - b‖₂² ≤ ε   (Basis Pursuit Denoising)
approximation   min ‖Ax - b‖₂²   subject to ‖x‖₁ ≤ k   (Lasso)
regularization  min ‖Ax - b‖₂² + α ‖x‖₁

convex relaxation

L⁰-norm改成L¹-norm。

提升一個次方，變成凸函數，方便最佳化。

演算法（iterative thresholding）

鄰近梯度法。概念同前。

regularization
min ‖Ax - b‖₂² + α ‖x‖₁

proximal gradient descent
⎰ x := x - step ⋅ 2Aᵀ (Ax - b)
⎱ x := S_½α(x)

S is soft thresholding operator

‖x‖₁是凸函數，演算法會收斂至最小值。

演算法（alternating direction method of multipliers）

交替方向乘數法。概念同前。

minimization
min ‖x‖₁   subject to Ax = b

indicator
min f(x) + g(x)   where f(x) = ‖x‖₁ , g(x) = ⎰ 0   if Ax = b
                                             ⎱ ∞   otherwise
lagrangian scaled form
min f(x) + g(y) + (ρ/2) ‖x - y + μ‖₂² - (ρ/2) ‖μ‖₂²   where μ = λ/ρ

admm
⎧ x := S_α/ρ(y + μ)
⎨ y := (I - Aᵀ(AAᵀ)⁻¹A)(x + μ) + Aᵀ(AAᵀ)⁻¹b
⎩ μ := μ + x - y   let step ⋅ ρ = 1

S is soft thresholding operator

‖x‖₁是凸函數，演算法會收斂至最小值。

演算法（linear programming）

‖x‖₁從0切開是兩條一次函數，得以使用一次規劃。同時考慮x的正負；最小值所在之處，正負必定相消。

minimization
min ‖x‖₁   subject to Ax = b

linear programming
min cᵀz   subject to [A -A] z = b     where c = [1 1 1 ...]ᵀ
z≥0

Lᵖ-norm minimization

推廣成任意數。但是沒有實際用處。

[0,1)是NP-hard問題，[1,∞)是P問題。1是分水嶺。

演算法（iterative thresholding）

regularization
min ‖Ax - b‖₂² + α ‖x‖_p

proximal gradient descent
⎰ x := x - step ⋅ 2Aᵀ (Ax - b)
⎱ x := S_α,p(x)

S is shrinkage thresholding operator

演算法（iteratively reweighted least squares）

https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
http://bigwww.epfl.ch/guerquin/documents/IRLS.pdf

不斷以L²-norm解答近似Lᵖ-norm解答。

一、每筆誤差各自擁有權重。最初為1。
二、重複下述步驟，直到收斂：
　甲、問題改成加權版本、改成L²-norm。求解，求誤差。
　乙、權重除以誤差平方，以降低誤差影響力。
　　　（當p=2，調整權重之後，誤差為1。）

matrix norm minimization

rank minimization（low-rank recovery）

一次方程組求解，解盡量低秩。NP-hard。

聯立風格
⎰ solve AX = B                      [underdetermined system]
⎱ minimize rank(X)

約束最佳化風格
min rank(X)   subject to AX = B     [underdetermined system]

於是大家只好採用最佳化的套路，得到三種問題。

sparsification  min rank(X)   subject to ‖AX - B‖ꜰ² ≤ ε
approximation   min ‖AX - B‖ꜰ²   subject to rank(X) ≤ k
regularization  min ‖AX - B‖ꜰ² + α rank(X)

演算法（iterative thresholding）

鄰近梯度法。概念同前。

regularization
min ‖AX - B‖ꜰ² + α rank(X)

proximal gradient descent
⎰ X := X - step ⋅ Aᵀ(AX - B)
⎱ X := 𝓗_√α(X)

𝓗 is singular value hard thresholding operator

演算法（alternating minimization）

改良版。避免每回合都要計算奇異值分解。

事先計算B的奇異值分解B = UΣVᵀ。事先把X拆成兩個矩陣相乘X = LRᵀ，用來取代SVD。

初始值設定成L = U√Σ、Rᵀ = √ΣVᵀ，只保留前k大的奇異值暨奇異向量。尺寸設定成L: m×k、Rᵀ: k×n，使得秩不超過k。

分塊座標下降法。L與R視作兩個維度，輪流行進。

一、alternating descent
　　梯度下降法。往梯度方向走一步。
二、alternating least squares
　　最陡下降法。一次微分等於零的地方，推導公式解，一步走到最低處。

approximation
min ‖AX - B‖ꜰ²   subject to rank(X) ≤ k

factorization
min ‖ALRᵀ - B‖ꜰ²   where X: m×n, L: m×k, Rᵀ: k×n

alternating descent
⎰ L := L - step ⋅ d/dL ‖ALRᵀ - B‖ꜰ²
⎱ R := R - step ⋅ d/dR ‖ALRᵀ - B‖ꜰ²

alternating least squares
⎰ L := argmin ‖ALRᵀ - B‖ꜰ²
⎱ R := argmin ‖ALRᵀ - B‖ꜰ²

演算法（alternating direction method of multipliers）

nuclear norm minimization

走火入魔。然而是P問題？【尚待確認】

聯立風格
⎰ solve AX = B                   [underdetermined system]
⎱ minimize ‖X‖⁎

約束最佳化風格
min ‖X‖⁎   subject to AX = B     [underdetermined system]

仍然可以採用最佳化的套路，得到三種問題。

sparsification  min ‖X‖⁎   subject to ‖AX - B‖ꜰ² ≤ ε
approximation   min ‖AX - B‖ꜰ²   subject to ‖X‖⁎ ≤ k
regularization  min ‖AX - B‖ꜰ² + α ‖X‖⁎

convex relaxation

rank改成nuclear norm。

提升一個次方，變成凸函數，方便最佳化。

演算法（iterative thresholding）

鄰近梯度法。概念同前。

regularization
min ‖AX - B‖ꜰ² + α ‖X‖⁎

proximal gradient descent
⎰ X := X - step ⋅ Aᵀ(AX - B)
⎱ X := 𝓢_½α(X)

𝓢 is singular value soft thresholding operator

Hölder's inequality

最小化nuclear norm化作最小化Frobenius norm。

‖XY‖⁎ ≤ ‖X‖ꜰ ‖Y‖ꜰ ≤ (‖X‖ꜰ² + ‖Y‖ꜰ²) / 2

演算法（alternating minimization）

分塊座標下降法。L與R視作兩個維度，輪流行進。

regularization
min ‖AX - B‖ꜰ² + α ‖X‖⁎

factorizarion
min ‖A(LRᵀ) - B‖ꜰ² + α ‖LRᵀ‖⁎

maximum margin matrix factorization
min ‖A(LRᵀ) - B‖ꜰ² + α/2 ‖L‖ꜰ² + α/2 ‖R‖ꜰ²

block coordinate descent
⎰ L := L - step ⋅ d/dL { ‖A(LRᵀ) - B‖ꜰ² + α/2 ‖L‖ꜰ² + α/2 ‖R‖ꜰ² }
⎱ R := R - step ⋅ d/dR { ‖A(LRᵀ) - B‖ꜰ² + α/2 ‖L‖ꜰ² + α/2 ‖R‖ꜰ² }

matrix splitting

robust principal component analysis

聯立風格
⎧ solve M = L + S
⎨ minimize rank(L)
⎩ minimize ‖S‖₀

多目標最佳化、約束最佳化風格
min rank(L) + α ‖S‖₀   subject to M = L + S     (α ≥ 0)

rank：線性獨立的向量數量。我們希望用最少的向量集合，表達整個矩陣M。

sparsity：零零碎碎的錯誤數量。我們希望估計錯誤的元素越少越好。

low-rank L：常見的、司空見慣的地方。數據不是其他數據的線性組合。宛如基因演算法，數據之間的組合，總是那幾套模式。

sparse S：罕見的、獨樹一格的地方。數據的某些維度，只有自己有值，而別人無值。

為何可以這樣分解呢？目前沒有數學系統可以完整詮釋。

此分解相當具有特色。可能是稀疏版本的二階泰勒展開。

robust principal component analysis名稱由來

PCA改寫成最佳化問題，穿鑿附會得到robust PCA。

我是覺得改個名字比較好啦。

min ‖X - AᵣAᵣᵀX‖ꜰ²   subject to AᵣᵀAᵣ = I         PCA
 W
min ‖X - L‖ꜰ²   subject to rank(L) ≤ k            AᵣAᵣᵀX咻一下變L
 L
min ‖X - L‖₀   subject to rank(L) ≤ k             L²-norm咻一下變L⁰-norm
 L
min ‖S‖₀   subject to rank(L) ≤ k and X = L + S   X - L改寫成S
L,S
min rank(L) + α ‖S‖₀   subject to X = L + S       robust PCA
L,S

convex relaxation

提升一個次方，變成凸函數，方便最佳化。

     min rank(L) + α ‖S‖₀
---> min ‖L‖⁎ + α ‖S‖₁        (principal component pursuit)

演算法（iterative thresholding）

分塊座標下降法。L與S視作兩個維度，輪流走到最低處。

regularization
min ‖L‖⁎ + α ‖S‖₁ + (ρ/2) ‖M - L - S‖ꜰ²

alternating proximal gradient method
⎰ L := 𝓢_1/ρ(M - S)
⎱ S := P_α/ρ(M - L)

𝓢 is singular value soft thresholding operator
P is soft thresholding operator

演算法（iterative thresholding）

分塊座標下降法。低秩矩陣L拆成兩個矩陣相乘L = XYᵀ。

我看不懂最佳化式子怎麼來的，有點奇怪。【尚待確認】

regularization
min ‖L‖⁎ + α ‖S‖₁ + (ρ/2) ‖M - L - S‖ꜰ²

factorization
 min ‖M - XYᵀ - S‖ꜰ² + 1/4 ‖XᵀX - YᵀY‖ꜰ²
X,Y,S 

alternating proximal gradient method
⎧ S := H(M - XYᵀ)
⎨ X := X - step_X ⋅ ∇_XF(X, Y, S)
⎩ Y := Y - step_Y ⋅ ∇_YF(X, Y, S)

H is hard thesholding operator

演算法（alternating direction method of multipliers）

交替方向乘數法。每個已知元素，追加拉格朗日乘數，併成矩陣Λ。我不清楚是否實用。

minimization
min ‖L‖⁎ + α ‖S‖₁   subject to M = L + S

lagrangian with penalty
L(L, S, Λ) = min ‖L‖⁎ + α ‖S‖₁ + dot(Λ, M-L-S) + (ρ/2) ‖M-L-S‖ꜰ²

admm
⎧ L := 𝓢_1/ρ(M - S + ρ⁻¹Λ)
⎨ S := P_α/ρ(M - L + ρ⁻¹Λ)
⎩ Λ := Λ + ρ ⋅ (M - L - S)

𝓢 is singular value soft thresholding operator
P is soft thresholding operator

matrix factorization

一個矩陣分解成兩個矩陣相乘。P問題。

solve M = XY

答案無限多種，例如QR分解、LU分解、特徵分解。

例如奇異值分解，從中間切一半：X = U√Σ、Y = √ΣVᵀ。

M = UΣVᵀ = (U√Σ)(√ΣVᵀ)

如果不喜歡中規中矩的答案，那麼可以化作最佳化問題，追加目標函數、約束條件，求得其他更對味的答案。

minimize ‖XY - M‖ꜰ²

regularized matrix factorization

矩陣分解，追加目標函數。

          rank regularization: min ‖XY - M‖ꜰ² + α rank(X) + β rank(Y)
  nuclear norm regularization: min ‖XY - M‖ꜰ² + α ‖X‖⁎ + β ‖Y‖⁎
Frobenius norm regularization: min ‖XY - M‖ꜰ² + α ‖X‖ꜰ² + β ‖Y‖ꜰ²

Frobenius norm比較常見。

Frobenius norm regularization
min ‖XY - M‖ꜰ² + α ‖X‖ꜰ² + β ‖Y‖ꜰ²

alternating least squares
⎰ X := MYᵀ(YᵀY + βI)⁻¹
⎱ Y := (XᵀX + αI)⁻¹XᵀM

若權重相同，運用Hölder's inequality將Frobenius norm化作nuclear norm，得到公式解。

http://jmlr.org/papers/volume16/hastie15a/hastie15a.pdf

Frobenius norm regularization (when α = β)
min ‖XY - M‖ꜰ² + α ‖X‖ꜰ² + α ‖Y‖ꜰ²

nuclear norm regularization
min ‖XY - M‖ꜰ² + 2α ‖XY‖⁎

solution
⎰ X = U√𝓢_α(Σ)
⎱ Yᵀ = √𝓢_α(Σ)Vᵀ

甚至有人將向量乘上不同權重。請見《Collaborative Filtering for Implicit Feedback Datasets》。

https://math.stackexchange.com/questions/1072451/

nonnegative matrix factorization

矩陣元素為正數或零，不可為負數。NP-hard問題。

min ‖XY - M‖ꜰ²   subject to X ≥ 0, Y ≥ 0

真實世界的數據通常是非負數，因而追加非負限制。

有時候，非負導致稀疏。最佳解位於負數，受到非負限制，最佳解恰好移動到零。

演算法（alternating minimization）

分塊座標下降法。L與R視作兩個維度，輪流走到最低處。隨時投影到非負數。

block coordinate descent
⎰ X := argmin ‖XY - M‖ꜰ²   subject to X ≥ 0
⎱ Y := argmin ‖XY - M‖ꜰ²   subject to Y ≥ 0

block coordinate descent & projected gradient method
⎰ X := max(0, MY⁺) = max(0, MY(YᵀY)⁻¹)
⎱ Y := max(0, X⁺M) = max(0, (XᵀX)⁻¹XM)

方法一：梯度下降法，max用來投影到限制條件：非負數。
方法二：最陡下降法，W和H視作兩個維度。
方法三：修改一下步伐大小、有的沒的，催出速度。

sparse matrix approximiation

指定稀疏度。P問題。

minimize ‖XY - M‖ꜰ²   subject to ‖XY‖₀ ≤ k

答案：保留前k大的元素，其餘元素歸零，然後矩陣分解。

low-rank matrix approximiation

指定秩。P問題。

minimize ‖XY - M‖ꜰ²   subject to rank(XY) ≤ k

答案：保留前k大的奇異值，其餘奇異值歸零，然後切一半。

sparse matrix factorization

矩陣分解，某些矩陣盡量稀疏。NP-hard問題。

聯立風格
⎧ solve M = XY
⎨ minimize ‖X‖₀
⎩ minimize ‖Y‖₀

多目標最佳化、約束最佳化風格
min ‖X‖₀ + ‖Y‖₀   subject to M = XY

於是大家只好採用最佳化的套路，得到三種問題。

sparsification  min ‖X‖₀ + ‖Y‖₀   subject to  ‖XY - M‖ꜰ² ≤ ε
approximation   min ‖XY - M‖ꜰ²   subject to ‖X‖₀ ≤ k₁, ‖Y‖₀ ≤ k₂
regularization  min ‖XY - M‖ꜰ² + α ‖X‖₀ + β ‖Y‖₀

稀疏度，可以只處理第一個矩陣、只處理第二個矩陣、兩者皆處理（我沒見過）。

approximation
min ‖XY - M‖²   subject to ‖X‖₀ ≤ k     (sparse dictionary learning)
min ‖XY - M‖²   subject to ‖Y‖₀ ≤ k     (sparse coding) (compressed sensing)
min ‖XY - M‖²   subject to ‖X‖₀ ≤ k₁, ‖Y‖₀ ≤ k₂      (???)

甚至是矩陣的每個向量。

approximation
min ‖XY - M‖²   subject to ‖Y_col(i)‖₀ ≤ kᵢ for all i

引入線性函數的觀念，將矩陣切成直條。X視作座標軸，Y視作許多組座標值。我們可以指定座標軸稀疏度、或者指定座標值稀疏度、或者兩者皆指定（我沒見過）。

引入編碼學的觀念。X視作字典，Y視作編碼。我們希望讓字典和編碼盡量短，抓到重點，節省儲存空間。

                  basis       weights
    data        dictionary    codings
⎡ |  |  |  ⎤   ⎡ |  |  |  ⎤ ⎡ |  |  |  ⎤
⎢ v₁ v₂ v₃ ⎥ = ⎢ d₁ d₂ d₃ ⎥ ⎢ c₁ c₂ c₃ ⎥
⎣ |  |  |  ⎦   ⎣ |  |  |  ⎦ ⎣ |  |  |  ⎦
     M              X            Y

順帶一提，當Y稀疏度指定為一，恰是k-means clustering。

演算法（k-SVD）

分塊座標下降法。輪流更新X和Y。

approximation
min ‖XY - M‖ꜰ²   subject to ‖Y_col(i)‖₀ ≤ kᵢ for all i

block coordinate descent
⎰ Y := orthogonal matching pursuit
⎱ X := singular value decomposition

初始化字典X：隨機矩陣。令X每個向量長度均為一。

已知字典X，求編碼Y：依序求得每個Y直條。L⁰-norm minimization，演算法採用orthogonal matching pursuit。

已知編碼Y，求字典X：依序求得每個X直條。過程如下：

  min ‖XY - M‖ꜰ²
= min ‖∑_i{X_col(i) Y_row(i)} - M‖ꜰ²
= min ‖X_col(k) Y_row(k) + ∑_i≠k{X_col(i) Y_row(i)} - M‖ꜰ²
= min ‖X_col(k) Y_row(k) - E‖ꜰ²
= min ‖X_col(k) Y_row(k) Ω - EΩ‖ꜰ²

一、X拆成K個直條，Y拆成K個橫條。
　　一個X直條乘以一個Y橫條，得到一個矩陣。
　　XY拆成K個矩陣相加。
二、嘗試求得第k個X直條，X_col(k)。
　　將其他K-1個矩陣併入M，得到E。
三、為了維持Y的稀疏度，
　　Y_row(k)刪除元素為零的直條，得到Y_row(k)Ω。
　　E也刪除對應直條，得到EΩ。
四、EΩ的最大的奇異值的奇異向量（左邊矩陣U），當作新的X_col(k)。

演算法（minimum residual method）【尚無正式名稱】

https://arxiv.org/pdf/1311.3315

最小餘數法。每回合求得一個矩陣：外積四捨五入開根號。

minimization
⎰ solve M = X₁ X₂ X₃ ...
⎱ minimize ‖X₁‖₀ + ‖X₂‖₀ + ‖X₃‖₀ + ...

minimum residual method
let M = XY   where X = X₁, Y = X₂ X₃ ...
⎧ XXᵀ := round(MMᵀ)               // assume MMᵀ ≈ XXᵀ
⎨ X := E√Λ   where XXᵀ = EΛEᵀ     // get X from XXᵀ
⎩ Y := X⁻¹M                       // remove X from M

sparse nonnegative matrix factorization

非負矩陣分解，某些矩陣盡量稀疏。

演算法（alternating minimization）

https://arxiv.org/pdf/cs/0202009

regularization
min 1/2 ‖XY - M‖ꜰ² + α ‖Y‖₁

alternating descent
⎧ X := X - step ⋅ (XY - M)Yᵀ       // gradient descent
⎨ X := max(0, X)                   // projection
⎪ Xᵢⱼ := Xᵢⱼ / ∑ᵢ Xᵢⱼ              // column normalization
⎩ Y := Y ⊙ (XᵀM) ⊘ (XᵀXY + α)     // element-wise product & division

low-rank matrix factorization

照理來說問題應該是：某些矩陣盡量低秩。

聯立風格
⎧ solve M = X₁ X₂ ...
⎨ minimize rank(X₁)
⎪ minimize rank(X₂)
⎩        :

多目標最佳化、約束最佳化風格
min rank(X₁) + rank(X₂) + ...   subject to M = X₁ X₂ ...

網民卻把問題改成萌萌噠：相乘之後的秩盡量小。

⎰ solve M = XY
⎱ minimize rank(XY)

根本就是low-rank matrix approximation。嘔嘔嘔嘔嘔。