|
374 | 374 | "cell_type": "markdown", |
375 | 375 | "metadata": {}, |
376 | 376 | "source": [ |
377 | | - "With this in place, we can take a look at what the GMM model gives us for our initial data:" |
| 377 | + "With this in place, we can take a look at what the four-component GMM gives us for our initial data:" |
378 | 378 | ] |
379 | 379 | }, |
380 | 380 | { |
|
561 | 561 | "metadata": {}, |
562 | 562 | "source": [ |
563 | 563 | "Here the mixture of 16 Gaussians serves not to find separated clusters of data, but rather to model the overall *distribution* of the input data.\n", |
564 | | - "This is a generative model of the distribution, meaning that the GMM model gives us the recipe to generate new random data distributed similarly to our input.\n", |
565 | | - "For example, here are 400 new points drawn from this 16-component GMM model to our original data:" |
| 564 | + "This is a generative model of the distribution, meaning that the GMM gives us the recipe to generate new random data distributed similarly to our input.\n", |
| 565 | + "For example, here are 400 new points drawn from this 16-component GMM fit to our original data:" |
566 | 566 | ] |
567 | 567 | }, |
568 | 568 | { |
|
604 | 604 | "The fact that GMM is a generative model gives us a natural means of determining the optimal number of components for a given dataset.\n", |
605 | 605 | "A generative model is inherently a probability distribution for the dataset, and so we can simply evaluate the *likelihood* of the data under the model, using cross-validation to avoid over-fitting.\n", |
606 | 606 | "Another means of correcting for over-fitting is to adjust the model likelihoods using some analytic criterion such as the [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion) or the [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion).\n", |
607 | | - "Scikit-Learn's ``GMM`` model actually includes built-in methods that compute both of these, and so it is very easy to operate on this approach.\n", |
| 607 | + "Scikit-Learn's ``GMM`` estimator actually includes built-in methods that compute both of these, and so it is very easy to operate on this approach.\n", |
608 | 608 | "\n", |
609 | 609 | "Let's look at the AIC and BIC as a function as the number of GMM components for our moon dataset:" |
610 | 610 | ] |
|
725 | 725 | "cell_type": "markdown", |
726 | 726 | "metadata": {}, |
727 | 727 | "source": [ |
728 | | - "We have nearly 1,800 digits in 64 dimensions, and we can build a GMM model on top of these to generate more.\n", |
729 | | - "GMM can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data.\n", |
| 728 | + "We have nearly 1,800 digits in 64 dimensions, and we can build a GMM on top of these to generate more.\n", |
| 729 | + "GMMs can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data.\n", |
730 | 730 | "Here we will use a straightforward PCA, asking it to preserve 99% of the variance in the projected data:" |
731 | 731 | ] |
732 | 732 | }, |
|
0 commit comments