Nonlinear dimensionality reduction: Difference between revisions

Content deleted Content added

Inline

Revision as of 15:37, 30 June 2010

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension then the data can be visualised in the low dimensional space.

Below is a summary of some of the important algorithms from the history of manifold learning and nonlinear dimensionality reduction. Many of these non-linear dimensionality reduction methods are related to the linear methods listed below. Non-linear methods can be broadly classified into two groups: those that provide a mapping (either from the high dimensional space to the low dimensional embedding or vice versa), and those that just give a visualisation. In the context of machine learning, mapping methods may be viewed as a preliminary feature extraction step, after which pattern recognition algorithms are applied. Typically those that just give a visualisation are based on proximity data - that is, distance measurements.

Linear methods

Independent component analysis (ICA).
Principal component analysis (PCA) (also called Karhunen–Loève transform — KLT).
Singular value decomposition (SVD).
Factor analysis.

Uses for NLDR

Consider a dataset represented as a matrix (or a database table), such that each row represents a set of attributes (or features or dimensions) that describe a particular instance of something. If the number of attributes is large, then the space of unique possible rows is exponentially large. Thus, the larger the dimensionality, the more difficult it becomes to sample the space. This causes many problems. Algorithms that operate on high-dimensional data tend to have a very high time complexity. Many machine learning algorithms, for example, struggle with high-dimensional data. This has become known as the curse of dimensionality. Reducing data into fewer dimensions often makes analysis algorithms more efficient, and can help machine learning algorithms make more accurate predictions.

Humans often have difficulty comprehending data in many dimensions. Thus, reducing data to a small number of dimensions is useful for visualization purposes.

The reduced-dimensional representations of data are often referred to as "intrinsic variables". This description implies that these are the values from which the data was produced. For example, consider a dataset that contains images of a letter 'A', which has been scaled and rotated by varying amounts. Each image has 32x32 pixels. Each image can be represented as a vector of 1024 pixel values. Each row is a sample on a two-dimensional manifold in 1024-dimensional space (a Hamming space). The intrinsic dimensionality is two, because two variables (rotation and scale) were varied in order to produce the data. Information about the shape or look of a letter 'A' is not part of the intrinsic variables because it is the same in every instance. Nonlinear dimensionality reduction will discard the correlated information (the letter 'A') and recover only the varying information (rotation and scale). The image below shows sample images from this dataset (to save space, not all input images are shown), and a plot of the two-dimensional points that results from using a NLDR algorithm (in this case, Manifold Sculpting was used) to reduce the data into just two dimensions.

File:Nldr.jpg

By comparison, if PCA (a linear dimensionality reduction algorithm) is used to reduce this same dataset into two dimensions, the resulting values are not so well organized. This demonstrates that the high-dimensional vectors (each representing a letter 'A') that sample this manifold vary in a non-linear manner.

File:Letters pca.jpg

It should be apparent, therefore, that NLDR has several applications in the field of computer-vision. For example, consider a robot that uses a camera to navigate in a closed static environment. The images obtained by that camera can be considered to be samples on a manifold in high-dimensional space, and the intrinsic variables of that manifold will represent the robot's position and orientation. This utility is not limited to robots. Dynamical systems, a more general class of systems, which includes robots, are defined in terms of a manifold. Active research in NLDR seeks to unfold the observation manifolds associated dynamical systems to develop techniques for modeling such systems and enable them to operate autonomously.

Manifold learning algorithms

Some of the more prominent manifold learning algorithms are listed below. An algorithm may learn an internal model of the data, which can be used to map points unavailable at training time into the embedding in a process often called out-of-sample extension.

Principal curves and manifolds

Principal curves and manifolds give the natural geometric framework for nonlinear dimensionality reduction and extend the geometric interpretation of PCA by explicitly constructing an embedded manifold, and by encoding using standard geometric projection onto the manifold. How to define the "simplicity" of the manifold is problem-dependent, however, it is commonly measured by the intrinsic dimensionality and/or the smoothness of the manifold.^[1]

Kernel Principal Component Analysis

Perhaps the most widely used algorithm for manifold learning is kernel PCA^[2]. It is a combination of Principal component analysis and the kernel trick. PCA begins by computing the covariance matrix of the $m\times n$ matrix $\mathbf {X}$

$C={\frac {1}{m}}\sum _{i=1}^{m}{\mathbf {x} _{i}\mathbf {x} _{i}^{\mathsf {T}}}.$

It then projects the data onto the first k eigenvectors of that matrix. By comparison, KPCA begins by computing the covariance matrix of the data after being transformed into a higher-dimensional space,

$C={\frac {1}{m}}\sum _{i=1}^{m}{\Phi (\mathbf {x} _{i})\Phi (\mathbf {x} _{i})^{\mathsf {T}}}.$

It then projects the transformed data onto the first k eigenvectors of that matrix, just like PCA. It uses the kernel trick to factor away much of the computation, such that the entire process can be performed without actually computing $\Phi (\mathbf {x} )$ . Of course $\Phi$ must be chosen such that it has a known corresponding kernel. Unfortunately, it is not trivial to find a good kernel for a given problem, so KPCA does not yield good results with some problems. For example, it is known to perform poorly with the swiss roll manifold.

KPCA has an internal model, so it can be used to map points onto its embedding that were not available at training time.

Gaussian process latent variable models

Gaussian process latent variable models (GPLVM)^[3] are a probabilistic non-linear PCA. Like kernel PCA they use a kernel function to form the mapping (in the form of a Gaussian process). However in the GPLVM the mapping is from the embedded space to the data space (like density networks and GTM) whereas in kernel PCA it is in the opposite direction.

Kohonen Maps

Kohonen maps (also called self-organizing maps or SOM) and its probabilistic variant generative topographic mapping (GTM) use a point representation in the embedded space to form a latent variable model based on a non-linear mapping from the embedded space to the high dimensional space. These techniques are related to work on density networks, which also are based around the same probabilistic model.

Curvilinear Distance Analysis

CDA^[4] trains a self-organizing neural network to fit the manifold and seeks to preserve geodesic distances in its embedding. It based on Curvilinear Component Analysis (which extended Sammon's mapping), but uses geodesic distances instead.

Isomap

Isomap^[5] is a combination of the Floyd-Warshall algorithm with classic Multidimensional Scaling. Classic Multidimensional Scaling (MDS) takes a matrix of pair-wise distances between all points, and computes a position for each point. With NLDR algorithms like Isomap, however, the pair-wise distances are only known between neighboring points. So Isomap uses the Floyd-Warshall algorithm to compute the pair-wise distances between all of the other points. This effectively estimates the full matrix of pair-wise geodesic distances between all of the points. Isomap then uses classic MDS to compute the reduced-dimensional positions of all the points.

Curvilinear Distance Analysis (CDA) is similar to Isomap in that it also estimates geodesic distances in the same manner, and seeks to preserve geodesic distances while projecting the data into fewer dimensions. Isomap is often preferred because it computes the final embedding using MDS, which uses an eigenvector-based optimization technique instead of the self-organizing-network training technique used by CDA. Although Isomap tends to produce somewhat poorer results than CDA^[6], it is significantly faster and is simple to implement. Its weakness is primarily due to inaccuracies in its estimate of geodesic distance. It tends to produce especially poor results near unsampled regions of the manifold. Isomap has no internal model, so only points available at training time are mapped onto the embedding.

Landmark-Isomap is a variant of this algorithm that uses landmarks to increase speed, at the cost of some accuracy.

Locally-Linear Embedding

Locally-Linear Embedding (LLE)^[7] was presented at approximately the same time as Isomap. It has several advantages over Isomap, including faster optimization when implemented to take advantage of sparse matrix algorithms, and better results with many problems. LLE also begins by finding a set of the nearest neighbors of each point. It then computes a set of weights for each point that best describe the point as a linear combination of its neighbors. Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points, such that each point is still described with the same linear combination of its neighbors. LLE tends to handle non-uniform sample densities poorly because there is no fixed unit to prevent the weights from drifting as various regions differ in sample densities. LLE has no internal model.

LLE computes the barycentric coordinates of a point X_i based on its neighbors X_j. The original point is reconstructed by a linear combination, given by the weight matrix W_ij, of its neighbors. The reconstruction error is given by the cost function E(W).

 $E(W)=\sum _{i}|{\mathbf {X} _{i}-\sum _{j}{\mathbf {W} _{ij}\mathbf {X} _{j}}|}^{\mathsf {2}}$

The weights W_ij refer to the amount of contribution the point X_j has while reconstructing the point X_i. The cost function is minimized under 2 constraints: (a) Each data point X_i is reconstructed only from its neighbors, thus enforcing W_ij to be zero if point X_j is not a neighbor of the point X_i and (b) The sum of every row of the weight matrix equals 1.
$\sum _{j}{\mathbf {W} _{ij}}=1$
The original data points are collected in a D dimensional space and the goal of the algorithm is to reduce the dimensionality to d such that D >> d. The same weights W_ij that reconstructs the i th data point in the D dimensional space will be used to reconstruct the same point in the lower d dimensional space. A neighborhood preserving map is created based on this idea. Each point X_i in the D dimensional space is mapped onto a point Y_i in the d dimensional space by minimizing the cost function

 $C(W)=\sum _{i}|{\mathbf {Y} _{i}-\sum _{j}{\mathbf {W} _{ij}\mathbf {Y} _{j}}|}^{\mathsf {2}}$

In this cost function unlike the previous one the weights W_ij are kept fixed and the minimization is done on the points Y_i to optimize the coordinates. This minimization problem can be solved by solving a sparse N X N eigen value problem, whose bottom d nonzero eigen vectors provide an orthogonal set of coordinates. Generally the data points are reconstructed from K nearest neighbors, as measured by Euclidean distance. For such an implementation the algorithm has only one free parameter K, which can be chosen by cross validation.

Laplacian Eigenmaps

Laplacian Eigenmaps ^[8] uses spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low dimensional manifold in a high dimensional space ^[9]. This algorithm cannot embed out of sample points, but techniques based on Reproducing kernel Hilbert space regularization exist for adding this capability.^[10] Such techniques can be applied to other nonlinear dimensionality reduction algorithms as well.

Traditional techniques like Principal Component Analysis do not consider the intrinsic geometry of the data. Laplacian Eigenmaps builds a graph from neighborhood information of the data set. Each data point serves as a node on the graph and connectivity between nodes is governed by the proximity of neighboring points (using e.g. the k-nearest neighbor algorithm). The graph thus generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. The eigenfunctions of the Laplace-Beltrami operator on the manifold serve as the embedding dimensions, since under mild conditions this operator has a countable spectrum that is a basis for square integrable functions on the manifold (compare to Fourier series on the unit circle manifold). Attempts to place Laplacian eigenmaps on solid theoretical ground have met with some success, as under certain nonrestrictive assumptions, the graph Laplacian matrix has been shown to converge to the Laplace-Beltrami operator as the number of points goes to infinity ^[11]. Matlab code for Laplacian Eigenmaps can be found in http://www.cse.ohio-state.edu/~mbelkin/algorithms/algorithms.html and the PhD thesis of Belkin can be found at http://www.cse.ohio-state.edu/~mbelkin/papers/papers.html#thesis.

Diffusion Maps

Diffusion maps leverages the relationship between heat diffusion and a random walk (Markov Chain); an analogy is drawn between the diffusion operator on a manifold and a Markov transition matrix operating on functions defined on the graph whose nodes were sampled from the manifold^[12]. This technique bears many similarities to Laplacian eigenmaps. The principal difference is that Laplacian eigenmaps utilizes a sparse similarity matrix, usually computed via k-nearest neighbors or epsilon balls. A minor difference is that Diffusion maps essentially computes the spectrum of the normalized graph Laplacian (as is required for the random walk interpretation), while Laplacian eigenmaps can be performed with the combinatorial or normalized Laplacian.

Hessian LLE

Like LLE, Hessian LLE is also based on sparse matrix techniques. It tends to yield results of a much higher quality than LLE. Unfortunately, it has a very costly computational complexity, so it is not well-suited for heavily-sampled manifolds. It has no internal model.

Local Tangent Space Alignment

LTSA^[13] is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the k-nearest neighbors of every point. It computes the tangent space at every point by computing the d-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces.

Maximum Variance Unfolding

Maximum Variance Unfolding was formerly known as Semidefinite Embedding. The intuition for this algorithm is that when a manifold is properly unfolded, the variance over the points is maximized. This algorithm also begins by finding the k-nearest neighbors of every point. It then seeks to solve the problem of maximizing the distance between all non-neighboring points, constrained such that the distances between neighboring points are preserved. The primary contribution of this algorithm is a technique for casting this problem as a semidefinite programming problem. Unfortunately, semidefinite programming solvers have a high computational cost. It has no model. The Landmark-MVU variant of this algorithm uses landmarks to increase speed with some cost to accuracy. It has no model.

Manifold Sculpting

Manifold Sculpting^[14] uses graduated optimization to find an embedding. Like other algorithms, it computes the k-nearest neighbors and tries to seek an embedding that preserves relationships in local neighborhoods. It slowly scales variance out of higher dimensions, while simultaneously adjusting points in lower dimensions to preserve those relationships. If the rate of scaling is small, it can find very precise embeddings. It boasts higher empirical accuracy than other algorithms with several problems. It can also be used to refine the results from other manifold learning algorithms. It struggles to unfold some manifolds, however, unless a very slow scaling rate is used. It has no model.

Local Multidimensional Scaling

Local Multidimensional Scaling^[15] performs multidimensional scaling in local regions, and then uses convex optimization to fit all the pieces together.

Autoencoders

A completely different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feed-forward neural networks. Although the idea of autoencoders is quite old, training of the encoders has only recently become possible through the use of Restricted Boltzmann machines. Related to autoencoders is the NeuroScale algorithm, which uses stress functions inspired by multidimensional scaling and Sammon mappings (see below) to learn a non-linear mapping from the high dimensional to the embedded space. The mappings in NeuroScale are based on radial basis function networks.

Curvilinear component analysis

Curvilinear component analysis (CCA) ^[16] looks for the configuration of points in the output space that preserves original distances as much as possible while focusing on small distances in the output space (conversely to Sammon's mapping which focus on small distances in original space).

Data-Driven High Dimensional Scaling

Data-Driven High Dimensional Scaling (DD-HDS) ^[17] is closely related to Sammon's mapping and curvilinear component analysis except that (1) it simultaneously penalizes false neighborhoods and tears by focusing on small distances in both original and output space, and that (2) it accounts for concentration of measure phenomenon by using a weighting function adapted on distance distribution.

RankVisu

RankVisu ^[18] is designed to preserve rank of neighborhoods rather than distance. RankVisu is especially useful on difficult tasks (when the preservation of distance cannot be achieved satisfyingly). Indeed, the rank of neighborhood is less informative than distance (ranks can be deduced from distances but distances cannot be deduced from ranks) and its preservation is thus easier.

Methods based on proximity matrices

A method based on proximity matrices is one where the data is presented to the algorithm in the form of a similarity matrix or a distance matrix. These methods all fall under the broader class of metric multidimensional scaling. The variations tend to be differences in how the proximity data is computed; for example, Isomap, locally linear embeddings, maximum variance unfolding, and Sammon mapping (which is not in fact a mapping) are examples of metric multidimensional scaling methods.

References

^ A. Gorban, B. Kegl, D. Wunsch, A. Zinovyev (Eds.), Principal Manifolds for Data Visualisation and Dimension Reduction, LNCSE 58, Springer, Berlin – Heidelberg – New York, 2007. ISBN 978-3-540-73749-0
^ B. Schölkopf, A. Smola, K.-R. Muller, Kernel Principal Component Analysis, In: Bernhard Schölkopf, Christopher J. C. Burges, Alexander J. Smola (Eds.), Advances in Kernel Methods-Support Vector Learning, 1999, MIT Press Cambridge, MA, USA, 327–352. ISBN 0-262-19416-3
^ The Gaussian Processes Web Site
^ P. Demartines and J. Hérault, Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets, IEEE Transactions on Neural Networks, Vol. 8(1), 1997, p. 148-154
^ J. B. Tenenbaum, V. de Silva, J. C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290, (2000), 2319–2323.
^ John Aldo Lee and Amaury Lendasse and Michel Verleysen, Curvilinear distance analysis versus isomap, Proceedings of ESANN’2002, 10th European Symposium on Artificial Neural Networks, 2002, p. 185-192
^ S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol 290, 22 December 2000, 2323–2326.
^ Mikhail Belkin and Partha Niyogi, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems 14, 2001, p. 586-691, MIT Press
^ Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The University Of Chicago, August 2003
^ Bengio et al. "Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering" in Advances in Neural Information Processing Systems (2004)
^ Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The University Of Chicago, August 2003
^ Diffusion Maps and Geometric Harmonics, Stephane Lafon, PhD Thesis, Yale University, May 2004
^ Zhenyue Zhang and Hongyuan Zha, Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment, SIAM Journal on Scientific Computing 26 (1) (2005), 313–338.
^ Gashler, M. and Ventura, D. and Martinez, T., Iterative Non-linear Dimensionality Reduction with Manifold Sculpting, In Platt, J.C. and Koller, D. and Singer, Y. and Roweis, S., editor, Advances in Neural Information Processing Systems 20, pp. 513-520, MIT Press, Cambridge, MA, 2008
^ J Venna and S Kaski, Local multidimensional scaling, Neural Networks, 2006
^ P. Demartines and J. Hérault, Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets, IEEE Transactions on Neural Networks, Vol. 8(1), 1997, p. 148-154
^ S. Lespinats, M. Verleysen, A. Giron, B. Fertil, DD-HDS: a tool for visualization and exploration of high dimensional data, IEEE Transactions on Neural Networks 18 (5) (2007) 1265–1279.
^ Lespinats S., Fertil B., Villemain P. and Herault J., Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13-15), pp. 2964-2978, 2009.
^ ELastic MAPs

External links

Isomap
Generative Topographic Mapping
Mike Tipping's Thesis
Gaussian Process Latent Variable Model
Locally Linear Embedding
Relational Perspective Map
Waffles is an open source C++ library containing implementations of LLE, Manifold Sculpting, and some other manifold learning algorithms.
DD-HDS homepage
RankVisu homepage

[1] A. Gorban, B. Kegl, D. Wunsch, A. Zinovyev (Eds.), Principal Manifolds for Data Visualisation and Dimension Reduction, LNCSE 58, Springer, Berlin – Heidelberg – New York, 2007. ISBN 978-3-540-73749-0

[2] B. Schölkopf, A. Smola, K.-R. Muller, Kernel Principal Component Analysis, In: Bernhard Schölkopf, Christopher J. C. Burges, Alexander J. Smola (Eds.), Advances in Kernel Methods-Support Vector Learning, 1999, MIT Press Cambridge, MA, USA, 327–352. ISBN 0-262-19416-3

[3] The Gaussian Processes Web Site

[4] P. Demartines and J. Hérault, Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets, IEEE Transactions on Neural Networks, Vol. 8(1), 1997, p. 148-154

[5] J. B. Tenenbaum, V. de Silva, J. C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290, (2000), 2319–2323.

[6] John Aldo Lee and Amaury Lendasse and Michel Verleysen, Curvilinear distance analysis versus isomap, Proceedings of ESANN’2002, 10th European Symposium on Artificial Neural Networks, 2002, p. 185-192

[7] S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol 290, 22 December 2000, 2323–2326.

[8] Mikhail Belkin and Partha Niyogi, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems 14, 2001, p. 586-691, MIT Press

[9] Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The University Of Chicago, August 2003

[10] Bengio et al. "Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering" in Advances in Neural Information Processing Systems (2004)

[11] Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The University Of Chicago, August 2003

[12] Diffusion Maps and Geometric Harmonics, Stephane Lafon, PhD Thesis, Yale University, May 2004

[13] Zhenyue Zhang and Hongyuan Zha, Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment, SIAM Journal on Scientific Computing 26 (1) (2005), 313–338.

[14] Gashler, M. and Ventura, D. and Martinez, T., Iterative Non-linear Dimensionality Reduction with Manifold Sculpting, In Platt, J.C. and Koller, D. and Singer, Y. and Roweis, S., editor, Advances in Neural Information Processing Systems 20, pp. 513-520, MIT Press, Cambridge, MA, 2008

[15] J Venna and S Kaski, Local multidimensional scaling, Neural Networks, 2006

[16] P. Demartines and J. Hérault, Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets, IEEE Transactions on Neural Networks, Vol. 8(1), 1997, p. 148-154

[17] S. Lespinats, M. Verleysen, A. Giron, B. Fertil, DD-HDS: a tool for visualization and exploration of high dimensional data, IEEE Transactions on Neural Networks 18 (5) (2007) 1265–1279.

[18] Lespinats S., Fertil B., Villemain P. and Herault J., Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13-15), pp. 2964-2978, 2009.

[19] ELastic MAPs

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

@@ Line 133: / Line 133: @@
 === RankVisu ===
-RankVisu <ref>Lespinats S., Fertil B., Villemain P. and Herault J., Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13-15), pp. 2964-2978, 2009.</ref> is designed to preserve rank of neighborhoods rather than distance. RankVisu seems is especially useful on difficult tasks (when the preservation of distance cannot be achieved satisfyingly). Indeed, the rank of neighborhood is less informative than distance (ranks can be deduced from distances but distances cannot be deduced from ranks) and its preservation is thus easier.
+RankVisu <ref>Lespinats S., Fertil B., Villemain P. and Herault J., Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13-15), pp. 2964-2978, 2009.</ref> is designed to preserve rank of neighborhoods rather than distance. RankVisu is especially useful on difficult tasks (when the preservation of distance cannot be achieved satisfyingly). Indeed, the rank of neighborhood is less informative than distance (ranks can be deduced from distances but distances cannot be deduced from ranks) and its preservation is thus easier.
 ==Methods based on proximity matrices==