Skip to content

Conversation

@samir-nasibli
Copy link
Contributor

@samir-nasibli samir-nasibli commented Oct 13, 2024

Description

This enables array API support for the PCA and IncrementalPCA sklearnex estimators. This required a refactor of the related sklearnex and onedal estimators in order to match design guidelines and simplify integration. This re-enables 50 sklearn tests.

This required the following changes:

  • Fix missing IncrementalPCA sklearn conformance tests (had been untested before)
  • cast device when the array supports the array API standard
  • Changed inheritance of the incrementalPCA onedal estimator to simplify code
  • Removed check_array use in onedal's IncrementalPCA as that is a sklearn conformance requirement
  • Remove use check_raw_input as its no longer relevant
  • Removed sklearn conformance in n_components in the oneDAL estimator
  • removed the double setting of self._queue
  • integrated use of the like keyword for proper return types
  • set _onedal_model as to None and to always exist as an attribute, following in line the new design rule
  • moved many sklearn conformance functions from the onedal estimator to sklearnex
  • removed the aliasing of predict to transform in the onedal estimator (and its use in testing)
  • Changed structure of fit and fit_transform for the PCA and IncrementalPCA sklearnex estimators to minimize maintenance (no longer duplicating sklearn code nor requiring version checking). This was done by patching fit and required _onedal_fit to return the validated X data for use later.
  • Simplified the calculations made in _onedal_cpu_supported and _onedal_gpu_supported
  • fixed problems in the IncrementalPCA patching which had a bug with respect to sparse inputs
  • sklearn conformance functions added (_validate_n_components, _postprocess_n_components and _compute_noise_variance). Added comments to clarify why things are done for simple future analysis
  • add get_namespace and modified corresponding validate_data dtypes throughout the code
  • set onedal estimator factory function as a PCA class method in order to easily create an SPMD interface
  • Added the necessary PCA SPMD sklearnex interface
  • Moved modification of the output results from onedal to sklearn for PCA sklearn conformance. This included adding copying behavior, generalized to array API.
  • created n_features_ property for sklearn conformance, rather than duplicating the same data twice
  • Just like linear model algorithms like LinearRegression and Ridge, attributes which are required to generate models for prediction/inference are set in a way where modifications in the sklearnex estimator (i.e. by a later fit which falls back to sklearn) are propagated properly to the onedal estimator. This will mean that the output of transform will always reflect the last fit.
  • check_is_fitted is properly added to IncrementalPCA
  • missing sklearn conformance is added in the n_components check (which matches fixes for an underlying bug in sklearn IncrementalPCA)
  • A new parameter is added to the sklearnex IncrementalPCA interface (svd_solver) which exposes the onedal_svd approach which is available in oneDAL. This matches the PCA keyword, but is not available in the sklearn IncrementalPCA implementation. This will require a deselection in test_patching when IncrementalPCA is brought out of preview
  • The finalize_fit sklearn conformance changes are moved from onedal to sklearnex in IncrementalPCA
  • re-enable deselected sklearnex tests with respect to design and end-to-end array api tests (test_patching)
  • Added testing for cases where attributes are modified, or a fallback sklearn fit occurs after a onedal fit.
  • Removed unnecessary shape restriction on oneDAL support.
  • Deselected tests related to limitations in the array_api_strict numpy backend dlpack support

No changes in performance are expected.


Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.
# set attributes necessary for calls to transform, will modify
# self._onedal_estimator, and clear any previous fit models
# Follow guidance from sklearn PCA._fit_full and copy the data
self.n_components_ = n_components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icfaust What happens if the number of non-zero singular values is less than all of (a) user-supplied n_components, (b) number of rows in the data, (c) number of columns in the data ?

Could you please add a test where there's lots of duplicated or linearly dependent columns up to the point where such situation is encountered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping an example here in case it isn't very clear what kind of inputs would trigger this:

import numpy as np rng = np.random.default_rng(seed=123) m = 4 n = 6 mu = rng.standard_normal(size=n) S = rng.standard_normal(size=(n+1,n)) S = S.T@S X = rng.multivariate_normal(mu, S, size=m) X[:, 3:] = X[:, :3]

This data has 4 rows, 6 columns, and 3 non-zero singular values:
image

In this case, sklearn doesn't drop the last singular value even though it should, but at least it doesn't end up crashing or anything like that.

@icfaust
Copy link
Contributor

icfaust commented Aug 11, 2025

/intelci: run

@icfaust
Copy link
Contributor

icfaust commented Aug 11, 2025

/intelci: run

@icfaust
Copy link
Contributor

icfaust commented Aug 11, 2025

/intelci: run

@icfaust
Copy link
Contributor

icfaust commented Aug 13, 2025

/intelci: run

assert not np.array_equal(_as_numpy(est.transform(X)), _as_numpy(est0.transform(X)))

# copy over parameters necessary for transform
est.mean_ = est0.mean_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if supposed to work, but this crashes when replacing them with arrays of another type. For example:

import dpnp est.mean_ = dpnp.array(est0.mean_, device="gpu", dtype=np.float32) est.components_ = dpnp.array(est0.components_, device="gpu", dtype=np.float32) est.explained_variance_ = dpnp.array(est0.explained_variance_, device="gpu", dtype=np.float32)
@david-cortes-intel
Copy link
Contributor

The failing test here has been popping up in other PRs too, but I guess this would be the right PR to address it:

=================================== FAILURES =================================== ______________________ test_pca_dtype_preservation[full] _______________________ svd_solver = 'full' @pytest.mark.parametrize("svd_solver", PCA_SOLVERS) def test_pca_dtype_preservation(svd_solver): > check_pca_float_dtype_preservation(svd_solver) decomposition/tests/test_pca.py:649: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ decomposition/tests/test_pca.py:669: in check_pca_float_dtype_preservation assert_allclose(pca_64.components_, pca_32.components_, rtol=2e-4) utils/_testing.py:292: in assert_allclose np_assert_allclose( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (<function assert_allclose.<locals>.compare at 0x7f62bf73beb0>, array([[-0.62022375, -0.15983497, 0.38316965, 0.6655...4071014, 0.9080595 , -0.21966814], [ 0.12497696, 0.8811203 , -0.16712739, 0.42435375]], dtype=float32)) kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=0.0002, atol=0', 'verbose': True} @wraps(func) def inner(*args, **kwds): with self._recreate_cm(): > return func(*args, **kwds) E AssertionError: E Not equal to tolerance rtol=0.0002, atol=0 E E Mismatched elements: 4 / 12 (33.3%) E Max absolute difference: 0.00014655 E Max relative difference: 0.00087689 E x: array([[-0.620224, -0.159835, 0.38317 , 0.66555 ], E [ 0.263179, 0.240851, 0.908007, -0.21966 ], E [ 0.124977, 0.881087, -0.167274, 0.424366]]) E y: array([[-0.620258, -0.159886, 0.38311 , 0.66554 ], E [ 0.263121, 0.24071 , 0.908059, -0.219668], E [ 0.124977, 0.88112 , -0.167127, 0.424354]], dtype=float32) ../../contextlib.py:79: AssertionError 
Copy link
Contributor

@david-cortes-intel david-cortes-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending these two:
#2106 (comment)
#2106 (comment)

@icfaust
Copy link
Contributor

icfaust commented Aug 17, 2025

/intelci: run

@icfaust
Copy link
Contributor

icfaust commented Aug 17, 2025

Issues related to pca dtype comparisons is underway in #2666 , the issues related to create_model are more extensive, as there is no prevention of trying to create a model from data on a non SYCL or CPU device, which would lead to odd failures in to_table and should be refactored for linear_regression, Ridge, and PCA algorithms simultaneously with centralized fixtures and testing.

@icfaust icfaust merged commit 35ccc0c into uxlfoundation:main Aug 17, 2025
28 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Array API enhancement New feature or request

5 participants