Personal code for principal component analysis and diffusion map examples. Specifically made to test the idea on some well-known types of data, but it wouldn't take much to modify the source for use with whatever data set or distance metric you desire.
$ make
A library is compiled with the classes needed for the main program and the main program links to that. The main program requires json-fortran. LAPACK is required for the library to calculate the eigenvectors and eigenvalues of various matrices.
Modify dmap.json
. Then do:
$ ./run dmap.json
You can also run principal component analysis using the following file:
$ ./run pca.json
bandwidth.json
is for running the program iteratively over different bandwidth values. See Figure S1 in this document for what I was going for with this. This would more helpful for analyzing simulation data, but the main program is not set up for that.
The extras
folder contains the source code of two programs to aid in generating example data sets. No configuration files are provided, so you will need to edit the source.
A few examples using this program.
Compare the swiss roll and punctured sphere results with those found in this paper, specifically in Section 3.1. Note that my value of bandwidth
is the square of what they call sigma
(I am not squaring the denominator of the Gaussian kernel in my code).
Colors indicate where points are in relationship to axis with greatest variance.
Colors indicate original cluster.
Colors indicate where points are in relationship to the center of the swiss roll.
Colors indicate where points are in relationship to axis that goes through the holes in the sphere.
The original data is from a Molecular Dynamics simulation I performed of a single octane in water. I used the RMSD between each pair of simulation snapshots of the octane as the distance metric for the diffusion map calculation (1,000 snapshots total). For the principal components analysis I used the dihedral angles as the metric. The colors indicate the radius of gyration of the octane. Compare these results with Figure S2.C from this paper's SI (PDF).
The branch alkane
has the modified code that performs these calculations. The original simulation trajectory is too large to post here. To reproduce the data, use this input file with GROMACS and run the simulation. Then use gmx trjconv
to fit the octane's translational and rotational motion, saving only the octane's coordinates. Use the output coordinate file (xtc) as the input for this analysis. By default the simulation will output 10,000 frames, so you may want to reduce this some for the diffusion map analysis, since it is very memory intensive.