In practical machine learning problems, the data is often very very high dimentional. For example bag of words representation of the document, images, etc. In such cases, we cannot apply the any meaningful over the counter machine learning algorithm on such data because of high memory requirement. The problem is aggravated if the such high dimensional data has missing entries. Consider, for example a classic case of Netflix. In Netflix dataset is matrix which contains user ratings for each movie for each user. Obviously, both number of movies and users are large. And not each movie is rated by each user. Consequently, we have missing entries. Now suppose we wish to cluster these users usings k-nearest neighbour algorithm. How do we do it? As mentioned, above even if we have somehow imputed the data, the k-nearest neighbour algorithm is likely to fail due to “curse of dimensionality”. As such, what we wish to do is reduce the dimensionality of data such that it still contains the interesting properties. More often that not, this compression is lossy and we wish to reduce the dimension such the loss is minimum while at the same time taking into account that missind values.

While the data may be very high dimensional, they will usually lie close to a much lower dimentional **manifold**. Think of manifold as warped surface in some high dimensional space. To justify this, consider the example of MNIST dataset. The MNIST dataset consists of 28x28 = 784 dimensional images of digits ( 0 through 9). Even if these images were binary, that is intensity of pixel can take only 0 or 1, there will be $2^{784}$ possible images. Nevertheless, an intelligent algorithm need not be shown all of these gigantic amount of images to be able to recognise which ones are digits. Digit like digits must therefore occupy a highly contrained volumed in the 784 dimentions and we expect only a small degrees of freedom to be required to descrbe this data reasonably well.

Suppose we have $N$ datapoint vectors $\mathbf{x}^n : n \in [1, 2, …, N]$. Note that we denote the $i^{th}$ element of $\mathbf{x}^n$ is represented as scalar $x^n_i$. We further assume that the dimension of our data vectors is $D$. Lets represent the corresponding design matrix as $X$ of dimension $D \times N$. Thus, we would like to approximate $\mathbf{x}^n$ into corresponding $\mathbf{y}^n$ of dimention $M$. Thus,

where $B$ is the matrix containing $M$ basis vectors. Thus, dimension of $B$ is $D \times N$. Also $\mathbf{c}$ is a constant vector of dimension $D$ to account for the offset.

As mentioned earlier, B is a transformation matrix, that contains basis vectors. Thus, it transforms $D$ dimensional points into $M$ dimensional point. However, the choice of B is obviously not unique. This is because any $M$ dimensional space can have infinite basis vectors to respresent any point. Consider, for example a 2D space. Any point can represented by cartesian coordinates. Now if this cartesian coordinate if rotated by 45 degrees, we should still be able to figure out any point. Similary, the scale of the basis vector does not matter as well. As such, we can impose further resptriction on out choice of B to make calculation easier. Without loss of generality one can assume $B$ to be an orthonormal matrix. That is

For more mathematically inclined, consider an invertible transformation $Q$ such that $BQ$ is an orthonormal matrix. Thus, $BY = BQQ^{-1}Y = \hat{B}\hat{Y}$ where we defined $\hat{B} = BQ$ and $\hat{Y} = Q^{-1}Y$. Thus, we can always consider to find particular $\hat{B}$ without loss of generality.

One approach would be to minimise the sum of the squared error between all $\mathbf{x}^n$ and its reconstruction. Mathematically,

The above loss function can be written in vector form as following.

where we have used the fact that $\mathbf{B} = \mathbf{B}^T\mathbf{B}$. Thus to find value of Y where the above loss function achieves minima, we partially differentiate it and equate to zero as following

We first replace the value of $Y$ found above, the loss function becomes

where in second last step, we use the fact that $(\mathbf{I} - \mathbf{BB}^T)$ is an idempotent matrix. An idempotent matrix $Q$ is such that $Q^TQ = Q$.

Thus differentiating with respect to $\mathbf{c}$ and equate to zero we get,

Thus we see that $\mathbf{c}$ is the just the mean of our data.

From above result, we observe that we can get rid of the unknown $\mathbf{c}$ from our equations, if center the data. That is we subtract the mean from all the data points. Notice that with centering, the value of reduced dimensions $\mathbf{y}^n$ reduces to $\mathbf{B}^T\mathbf{x}^n$. In following part of the post, I assume that the data is already centered.

So far in our analysis, we have not considered that $\mathbf{X}$ contains any missing value. Also, in above analysis, I have conveniently skipped the calculation to find out $\mathbf{B}$. This is because for dealing with missing data, we have to take slightly different approach. Suppose data sample $\mathbf{x}^n$ contains missing value at index $i$. In other words if $x^n_i$ is missing. To incoorperate this, we associate a mask $\mathbf{\gamma}^n$ with each $\mathbf{x}^n$ such that

with above, we can modify our loss function to account for penalty only for values that are present and ignore reconstruction at missing values. The modified loss function is given below

where we have skipped the offset vector $\mathbf{c}$ because we assume that the data is centered.

Unfortunately, with introduction of missing values, we cannot get a closed form solution straight away. However, there is clever trick we can employ to find (close to) optimal values of both $\mathbf{B}$ and $\mathbf{Y}$. Notice that if we assume $Y$ to be given, then loss function is a quadratic function of $\mathbf{B}$. Similarly, if we assume that $\mathbf{B}$ is already given, then loss function is a quadratic function of $\mathbf{Y}$. Thus, if we are able to solve for $\mathbf{y}^n$ assuming $\mathbf{B}$ given, we should expect our loss to decrease. Same hold for otherway round. Thus, we can employ strategy where we can alternatively optimize for $\mathbf{Y}$ and $\mathbf{B}$ and expect to get the final values to converge to the solution.

Differentiating the loss function with respect to $y^k_l$, and equation to zero, we see

where we have defined $c_l^k = \sum_{i=1}^D \gamma^k_i x_i^k B_{i, l}$ and $M_{l,j}^k = \sum_{i=1}^D \gamma^k_i B_{i,j} B_{i, l}$. Not that $\mathbf{c}$ is a vector of length $M$ and matrix $\mathbf{M}$ is $M \times M$. Thus, we can iterate through every point in $ k \in 1, 2, …, N$ and solve for above system of linear equation. It could be that matrix $\mathbf{M}$ is non invertible, in such case we can use psuedo inverse. This, is gauranteed to decrease the loss for given basis matrix $\mathbf{B}$.

######Computation complexity of finding $\mathbf{Y}$ For each $k$, calculating $M$ elements of $\mathbf{c}^k$ requires $D$ summations which is $O(MD)$. And computing $M \times M$ elements of $\mathbf{M}^k$ requires $D$ summations which is $O(DM^2)$. Matrix inversion of $M \times M$ matrix is $O(M^3)$. And because this needs to be done for each sample point, the overall time complexity is $O(NM^3+NDM^2§)$. Since, D is always greater than M in dimensionality reduction. Total computational complexity is $O(NDM^2)$.

Similar to finding for $\mathbf{y}^n$, we differentiate with respect to $B_{k,l}$ this time, and equate to zero.

where we defined $F^k_{l,j} = \sum_{n=1}^N\gamma^n_ky^n_jy^n_l$ and $m_l^k = \sum_{n=1}^N \gamma^n_k x_k^n y^n_l$. Notice that $\mathbf{b}^k$ is the $k^{th}$ row of $\mathbf{B}$. Also notice that dimensions of matrix $\mathbf{F}^k$ are $M \times M$ and $\mathbf{m}^k$ is a vector with $M$ elements. This way we can iterate over $k$ to solve $N$ linear system of equations to find optimal $\mathbf{B}$.

######Computation complexity of finding $\mathbf{B}$ For each row $k$, computation of $\mathbf{m}_l^k$ requires $N$ summations. And there are total of $M$ elements. Thus, it is $O(NM)$. Similary, for computation of $\mathbf{F}$ requires $O(NM^2)$. Considering Matrix inversion and the fact that this needs to be done for each row of $\mathbf{B}$, we get total time complexity of $O(DM^3 + DNM^2)$. Typically, $M$ is less than $N$. Thus, the final complexity is $O(DNM^2)$.

The complexity of $O(DNM^2)$ is intractable for large $N$ and $D$ which is often the case in practical datasets. How do we apply these ideas to large datasets. The idea is to break down the datasets and work in small batches. Let $\mathbf{B}^s$ denote the basis matrix calculated as above for $s^{th}$ batch. Then we can apply the following update rule.

Implementation of above can be found here. It is written in Python. I wrote it for one of the kaggle competetions as part of assignment in Applied Machine Learning course taught by Dr. David Barber. Most of the theory above is direct result of his amazing course lectures.

Below I use MNIST dataset (subset of 10000 images) and randomly make a certain percentage of its pixels missing. The original dimension is 28x28=784. The reduced dimension D (chosen randomly without any validation) is 60. Following are the result on different percentage of missing data on 10 randomly selected samples. In each case, top row is original samples, second row is data with missing pixels ( red being the missing pixel), and third row is the reconstruction.

- 10% percent missing data

- 30% percent missing data

- 50% percent missing data

- 70% percent missing data

It is amazing to see especially the last case where it is difficult to recognise actual digit even for humans, the algorithm seems to be doing reasonable well.