Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

Dimension Reduction and Visualization of Large High-Dimensional Data via InterpolationSeung-HeeBae, Jong Youl Choi, Judy Qiu, and Geoffrey FoxSchool of Informatics and ComputingPervasive Technology InstituteIndiana UniversitySALSA projecthttp://salsahpc.indiana.edu

OutlineIntroduction to Point Data VisualizationReview of Dimension Reduction Algorithms.Multidimensional Scaling (MDS)Generative Topographic Mapping (GTM)ChallengesInterpolationMDS InterpolationGTM InterpolationExperimental ResultsConclusion1

Point Data VisualizationVisualize high-dimensional data as points in 2D or 3D by dimension reduction.Distances in target dimension approximate to the distances in the original HD space.Interactively browse dataEasy to recognize clusters or groupsAn example of chemical data (PubChem)Visualization to display disease-gene relationship, aiming at finding cause-effect relationships between disease and genes.2

Multi-Dimensional ScalingPairwise dissimilarity matrixN-by-N matrixEach element can be a distance, score, rank, … Given Δ, find a mapping in the target dimensionCriteria (or objective function)STRESSSSTRESSSMACOF is one of algorithms to solve MDS problem3

Generative Topographic MappingK latent pointsN data pointsInput is high-dimensional vector points Latent Variable Model (LVM)Define K latent variables (zk)Map K latent points to the data space by using a non-linear function f (by EM approach)Construct maps of data points in the latent space based on Gaussian Mixture Model 4

GTM vs. MDS5GTMMDS (SMACOF) Non-linear dimension reduction

Find an optimal configuration in a lower-dimension

Iterative optimization methodPurposeMaximize Log-LikelihoodMinimize STRESS or SSTRESSObjectiveFunctionO(KN) (K << N)O(N2)ComplexityEMIterative Majorization (EM-like)OptimizationMethodVector representationPairwise Distance as well as VectorInputFormat

ChallengesData is getting larger and high-dimensionalPubChem : database of 60M chemical compoundsOur initial results on 100K sequences need to be extended to millions of sequencesTypical dimension 150-1000MDS Results on 768 (32x24) core cluster with 1.54TB memory6Interpolation reduces the computational complexityO(N2)  O(n2 + (N-n)n)

Interpolation ApproachTwo-step procedureA dimension reduction alg. constructs a mapping of n sample data (among total N data) in target dimension.Remaining (N-n) out-of-samples are mapped in target dimension w.r.t. the constructed mapping of the n sample data w/o moving sample mappings.MPInIn-sampleTrained dataTrainingN-nOut-of-sampleInterpolated map71Interpolation2......P-1pTotal N dataMapReduce

MDS InterpolationAssume it is given the mappings of n sampled data in target dimension (result of normal MDS).Landmark points (do not move during interpolation)Out-of-samples (N-n) are interpolated based on the mappings of n sample points.Find k-NN of the new point among n sample data.Based on the mappings of k-NN, find a position for a new point by the proposed iterative majorizing approach.Computational Complexity – O(Mn), M = N-n8

GTM InterpolationAssume it is given the position of K latent points based on the sample data in the latent space.The most time consuming part of GTMOut-of-samples (N-n) are positioned directly w.r.t. Gaussian Mixture Model between the new point and the given position of K latent points.Computational Complexity – O(M), M = N-n9

Quality Comparison (1)11GTM interpolation quality comparisonw.r.t. different sample size of N = 100kMDS interpolation quality comparisonw.r.t. different sample size of N = 100k

Quality Comparison (2)12GTM interpolation quality up to 2MMDS interpolation quality up to 2M

Parallel Efficiency13MDS parallel efficiency on Cluster-IIGTM parallel efficiency on Cluster-II

GTM Interpolation via MapReduce14GTM Interpolation–Time per core to process 100k data points per coreGTM Interpolationparallel efficiency26.4 million pubchem data

DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core with 1.7 GB.ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010

MDS Interpolation via MapReduce15DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instancesThilinaGunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox, “Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications,” in Proceedings of ECMLS Workshop of ACM HPDC 2010

MDS Interpolation Map16PubChem data visualization by using MDS (100k) and Interpolation (100k+100k).

GTM Interpolation Map17PubChem data visualization by using GTM (100k) and Interpolation (2M + 100k).

ConclusionDimension reduction algorithms (e.g. GTM and MDS) are computation and memory intensive applications.Apply interpolation (out-of-sample) approach to GTM and MDS in order to process and visualize large- and high-dimensional dataset.It is possible to process millions data point via interpolation.Could be parallelized by MapReduce fashion as well as MPI fashion.18

Future WorksMake available as a ServiceHierarchical Interpolation could reduce the computational complexity O(Mn) O(Mlog(n))19

AcknowledgmentOur internal collaborators in School of Informatics and Computing at IUBProf. David WildDr. Qian Zhu20

Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

More Related Content

What's hot

Viewers also liked

Similar to Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

Dimension Reduction And Visualization Of Large High Dimensional Data Via Interpolation

Editor's Notes