Author Archives: admin

Incomplete Cholesky Decomposition: Sect. 4.7 in Kernel Methods for Machine Learning

A positive definite symmetric matrix $A \in \mathbb{R}^{N \times N}$ can be expressed as $RR^\top$ using a lower triangular matrix $R \in \mathbb{R}^{N \times N}$. This is known as the Cholesky decomposition of $A$. The process follows these steps:

Given the relationship
$$ A_{j,i} = \sum_{h=1}^N R_{j,h} R_{i,h}, $$ we initially set $R = O$ (a zero matrix). Assuming the values of $R$ are determined up to the $(i-1)$-th column:
1. Set $R_{i,i} = \sqrt{A_{i,i} – \sum_{h=1}^{i-1} R_{i,h}^2}$.
2. For $j = i+1, \ldots, N$, set $R_{i,j} = \frac{A_{j,i} – \sum_{h=1}^{i-1} R_{j,h} R_{i,h}}{R_{i,i}}$.
This process is repeated for $i = 1, \ldots, N$.

To approximate $A$ as $A \approx \tilde{R} \tilde{R}^\top$ using a trapezoidal matrix $\tilde{R}$, where columns beyond the $(r+1)$-th are zeros, we perform an incomplete Cholesky decomposition. However, simply stopping at the $r$-th column is not sophisticated. We aim to improve the approximation accuracy.

First, we start with $B := A$ and modify the above steps as follows. Given the relationship $$ B_{j,i} = \sum_{h=1}^N R_{j,h} R_{i,h}, $$ we initially set $R = O$. Assuming the values of $R$ are determined up to the $(i-1)$-th column, we identify the $k$ that maximizes $R_{k,k} = \sqrt{B_{k,k} – \sum_{h=1}^{i-1} R_{k,h}^2}$ for $i \leq k \leq N$. We then swap the $i$-th and $k$-th rows and columns of $B$ and the $i$-th and $k$-th rows of $R$. Subsequently, for $j = i+1, \ldots, N$, we set
$$ R_{i,j} = \frac{A_{j,i} – \sum_{h=1}^{i-1} R_{j,h} R_{i,h}}{R_{i,i}}. $$

We also define a matrix $Q_i$ of size $N$ as an identity matrix with $Q_i(i,k) = Q_i(k,i) = 1$ and $Q_i(i,i) = Q_i(k,k) = 0$. Proceeding in this manner up to the $r$-th column, we use $P := Q_1 \cdots Q_r$ to obtain: [ A = PBP^\top \approx PRR^\top P^\top. ]

By this method, evaluating the error $A – R^\top R$ using its trace, it equals the sum $\sum_{i=r+1}^N R_{i,i}^2$ at the end of execution. Specifically, for $r+1 \leq i \leq N$, the diagonal element of $R^\top R$ is $\sum_{j=1}^r R_{i,j}^2$, and the difference from $A_{i,i}$ is $R_{i,i}^2$. Thus, we select $i$ in descending order of $R_{i,i}^2$. It is acceptable to stop at the $r$-th column predetermined, or at a predetermined value for $R_{i,i}^2$.

Moreover, the above procedure ensures that $R_{i,i}^2$ decreases monotonically. Indeed, if $R_{k,k}^2 = B_{k,k} – \sum_{h=1}^{i-1} R_{k,h}^2$ is not selected for the $i$-th column (being smaller than $R_{i,i}^2$), then for the $(i+1)$-th column, $R_{k,k}^2 = B_{k,k} – \sum_{h=1}^i R_{k,h}^2$, which further reduces by $R_{k,i}^2$.

At least one support vector exists.

The same discussion as in Chapter 8 (Support Vector Machine) of “Mathematics of Statistical Machine Learning” is also found in “The Elements of Statistical Learning” and Bishop’s book “Pattern Recognition and Machine Learning”. However, the reason why there exists at least one $i$ such that $y_i(\beta_0 + x_i\beta) = 1$ is not explained. I mentioned it in my book, but upon careful consideration, it turns out I was mistaken. Intuitively, the correct explanation is that “if there are no support vectors, the margin can be made larger”, but I had been thinking if there was a shorter mathematical expression for it.
First, if $y_1, \ldots, y_N$ are all equal and the optimal solution for $\beta$ is 0, then from $y_i(\beta_0 + x_i\beta) = 1$, we have $\beta_0 = y_i$. In other cases, we derive a contradiction by assuming that the union of sets $A = \{i | y_i(\beta_0 + x_i\beta) > 1 \}$ and $B = \{i | y_i(\beta_0 + x_i\beta) < 1 \}$ is equal to $\{1, \ldots, n\}$. (8.18) and (8.19) respectively are \[ \beta = C\sum_{i\in B}y_ix_i^\top \] \[ \sum_{i\in B}y_i = 0 \] Taking the sum for $i = 1, \ldots, N$ in (8.16), since $\alpha_i = 0$ for $i \in A$, we have
\[ C\sum_{i\in B}\{ y_i(\beta_0+x_i\beta)-(1-\epsilon_i)\} = 0 \] and substituting the above two equations, we get
\[ \beta^\top \beta = C\sum_{i\in B}(1-\epsilon_i) \] Substituting these into (8.3), we get
\[ L_P = C\{-\frac{1}{2}\sum_{i\in B}(1-\epsilon_i)+\sum_{i\in A}\epsilon_i+N\} \] For $i \in B$, from (8.14), $\epsilon_i > 0$. This means that, without changing the elements of sets $A$ and $B$, if any $\epsilon_i$ for $i \in B$ is decreased, the value of $L_P$ can be further decreased, contradicting optimality. Also, if $B$ is the empty set and only $A$ exists, then $\beta = 0$ implies that $y_i\beta_0 > 1$ holds for all $i \in A$, resulting in $y_1 = \cdots = y_N$ (which was excluded in this case). Therefore, $A \cup B \neq \{1, \ldots, N\}$.

Therefore, in either case, there exists at least one $i$ such that $y_i(\beta_0+x_i\beta) = 1$.

Table of Contents: “Kernel Methods for Machine Learning with Math & R/Python” in the book series (Dec. 2021)

The new books are available Dec. 2021 – Jan. 2022 and consist of many source programs as well as math propositions.

Chapter 1 Positive Definite Kernel
1.1 Positive-definiteness of matrix
1.2 kernel
1.3 Positive Definite Kernel
1.4 Probability
1.5 Boxner's theorem
1.6 string, tree, kernel kernel

Chapter 2 Hilbert space
2.1 Metric space and completeness
2.2 Linear space and inner product space
2.3 Hilbert space
2.4 Projection theorem
2.5 Linear operator
2.6 Compact operator

Chapter 3 Reproducing Kernel Hilbert Space
3.1 RKHS
3.2 Sobolev space
3.3 Mercer's theorem

Chapter 4 Kernel Calculation
4.1 Kernel ridge regression
4.2 Kernel analysis
4.3 Kernel SVM
4.4 Splines
4.5 Random Fourier function
4.6 Nistrom approximation
4.7 Cholesky decomposition

Chapter 5 MMD and HSIC
5.1 RKHS random variables
5.2 MMD and the two sample problem
5.3 Testing Independence via  HSIC
5.4 Kernel Kernel and Universal Kernel
5.5 Introduction to Experience Process

Chapter 6 Gauss Individual and Functional data Analysis
6.1 Regression
6.2 Classification
6.3 Auxiliary variable method
6.4 Karhunen-Loeve expansion
6.5 Functional data analysis