Incomplete Cholesky Decomposition: Sect. 4.7 in Kernel Methods for Machine Learning

A positive definite symmetric matrix $A \in \mathbb{R}^{N \times N}$ can be expressed as $RR^\top$ using a lower triangular matrix $R \in \mathbb{R}^{N \times N}$. This is known as the Cholesky decomposition of $A$. The process follows these steps:

Given the relationship
$$ A_{j,i} = \sum_{h=1}^N R_{j,h} R_{i,h}, $$ we initially set $R = O$ (a zero matrix). Assuming the values of $R$ are determined up to the $(i-1)$-th column:
1. Set $R_{i,i} = \sqrt{A_{i,i} – \sum_{h=1}^{i-1} R_{i,h}^2}$.
2. For $j = i+1, \ldots, N$, set $R_{i,j} = \frac{A_{j,i} – \sum_{h=1}^{i-1} R_{j,h} R_{i,h}}{R_{i,i}}$.
This process is repeated for $i = 1, \ldots, N$.

To approximate $A$ as $A \approx \tilde{R} \tilde{R}^\top$ using a trapezoidal matrix $\tilde{R}$, where columns beyond the $(r+1)$-th are zeros, we perform an incomplete Cholesky decomposition. However, simply stopping at the $r$-th column is not sophisticated. We aim to improve the approximation accuracy.

First, we start with $B := A$ and modify the above steps as follows. Given the relationship $$ B_{j,i} = \sum_{h=1}^N R_{j,h} R_{i,h}, $$ we initially set $R = O$. Assuming the values of $R$ are determined up to the $(i-1)$-th column, we identify the $k$ that maximizes $R_{k,k} = \sqrt{B_{k,k} – \sum_{h=1}^{i-1} R_{k,h}^2}$ for $i \leq k \leq N$. We then swap the $i$-th and $k$-th rows and columns of $B$ and the $i$-th and $k$-th rows of $R$. Subsequently, for $j = i+1, \ldots, N$, we set
$$ R_{i,j} = \frac{A_{j,i} – \sum_{h=1}^{i-1} R_{j,h} R_{i,h}}{R_{i,i}}. $$

We also define a matrix $Q_i$ of size $N$ as an identity matrix with $Q_i(i,k) = Q_i(k,i) = 1$ and $Q_i(i,i) = Q_i(k,k) = 0$. Proceeding in this manner up to the $r$-th column, we use $P := Q_1 \cdots Q_r$ to obtain: [ A = PBP^\top \approx PRR^\top P^\top. ]

By this method, evaluating the error $A – R^\top R$ using its trace, it equals the sum $\sum_{i=r+1}^N R_{i,i}^2$ at the end of execution. Specifically, for $r+1 \leq i \leq N$, the diagonal element of $R^\top R$ is $\sum_{j=1}^r R_{i,j}^2$, and the difference from $A_{i,i}$ is $R_{i,i}^2$. Thus, we select $i$ in descending order of $R_{i,i}^2$. It is acceptable to stop at the $r$-th column predetermined, or at a predetermined value for $R_{i,i}^2$.

Moreover, the above procedure ensures that $R_{i,i}^2$ decreases monotonically. Indeed, if $R_{k,k}^2 = B_{k,k} – \sum_{h=1}^{i-1} R_{k,h}^2$ is not selected for the $i$-th column (being smaller than $R_{i,i}^2$), then for the $(i+1)$-th column, $R_{k,k}^2 = B_{k,k} – \sum_{h=1}^i R_{k,h}^2$, which further reduces by $R_{k,i}^2$.

Machine Learning with Math and R/Python

100 Exercises for Building Logic

Incomplete Cholesky Decomposition: Sect. 4.7 in Kernel Methods for Machine Learning