- Notifications
You must be signed in to change notification settings - Fork 184
ENH: PCA via Array API based on #2096 #2106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: PCA via Array API based on #2096 #2106
Conversation
added array-api-compat to test env
added versioning for the get_nnamespace
| # set attributes necessary for calls to transform, will modify | ||
| # self._onedal_estimator, and clear any previous fit models | ||
| # Follow guidance from sklearn PCA._fit_full and copy the data | ||
| self.n_components_ = n_components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icfaust What happens if the number of non-zero singular values is less than all of (a) user-supplied n_components, (b) number of rows in the data, (c) number of columns in the data ?
Could you please add a test where there's lots of duplicated or linearly dependent columns up to the point where such situation is encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping an example here in case it isn't very clear what kind of inputs would trigger this:
import numpy as np rng = np.random.default_rng(seed=123) m = 4 n = 6 mu = rng.standard_normal(size=n) S = rng.standard_normal(size=(n+1,n)) S = S.T@S X = rng.multivariate_normal(mu, S, size=m) X[:, 3:] = X[:, :3]This data has 4 rows, 6 columns, and 3 non-zero singular values:

In this case, sklearn doesn't drop the last singular value even though it should, but at least it doesn't end up crashing or anything like that.
| /intelci: run |
| /intelci: run |
| /intelci: run |
| /intelci: run |
| assert not np.array_equal(_as_numpy(est.transform(X)), _as_numpy(est0.transform(X))) | ||
| | ||
| # copy over parameters necessary for transform | ||
| est.mean_ = est0.mean_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if supposed to work, but this crashes when replacing them with arrays of another type. For example:
import dpnp est.mean_ = dpnp.array(est0.mean_, device="gpu", dtype=np.float32) est.components_ = dpnp.array(est0.components_, device="gpu", dtype=np.float32) est.explained_variance_ = dpnp.array(est0.explained_variance_, device="gpu", dtype=np.float32)| The failing test here has been popping up in other PRs too, but I guess this would be the right PR to address it: |
david-cortes-intel left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending these two:
#2106 (comment)
#2106 (comment)
| /intelci: run |
| Issues related to pca dtype comparisons is underway in #2666 , the issues related to |
Description
This enables array API support for the PCA and IncrementalPCA sklearnex estimators. This required a refactor of the related sklearnex and onedal estimators in order to match design guidelines and simplify integration. This re-enables 50 sklearn tests.
This required the following changes:
likekeyword for proper return types_onedal_modelas to None and to always exist as an attribute, following in line the new design rulepredicttotransformin the onedal estimator (and its use in testing)fitandfit_transformfor the PCA and IncrementalPCA sklearnex estimators to minimize maintenance (no longer duplicating sklearn code nor requiring version checking). This was done by patchingfitand required_onedal_fitto return the validated X data for use later._onedal_cpu_supportedand_onedal_gpu_supported_validate_n_components,_postprocess_n_componentsand_compute_noise_variance). Added comments to clarify why things are done for simple future analysisget_namespaceand modified correspondingvalidate_datadtypes throughout the coden_features_property for sklearn conformance, rather than duplicating the same data twiceLinearRegressionandRidge, attributes which are required to generate models for prediction/inference are set in a way where modifications in the sklearnex estimator (i.e. by a later fit which falls back to sklearn) are propagated properly to the onedal estimator. This will mean that the output of transform will always reflect the last fit.check_is_fittedis properly added to IncrementalPCAn_componentscheck (which matches fixes for an underlying bug in sklearn IncrementalPCA)onedal_svdapproach which is available in oneDAL. This matches the PCA keyword, but is not available in the sklearn IncrementalPCA implementation. This will require a deselection intest_patchingwhen IncrementalPCA is brought out of previewNo changes in performance are expected.
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance