ENH: `PCA` via Array API based on #2096 #2106

samir-nasibli · 2024-10-13T20:03:10Z

Description

This enables array API support for the PCA and IncrementalPCA sklearnex estimators. This required a refactor of the related sklearnex and onedal estimators in order to match design guidelines and simplify integration. This re-enables 50 sklearn tests.

This required the following changes:

Fix missing IncrementalPCA sklearn conformance tests (had been untested before)
cast device when the array supports the array API standard
Changed inheritance of the incrementalPCA onedal estimator to simplify code
Removed check_array use in onedal's IncrementalPCA as that is a sklearn conformance requirement
Remove use check_raw_input as its no longer relevant
Removed sklearn conformance in n_components in the oneDAL estimator
removed the double setting of self._queue
integrated use of the like keyword for proper return types
set _onedal_model as to None and to always exist as an attribute, following in line the new design rule
moved many sklearn conformance functions from the onedal estimator to sklearnex
removed the aliasing of predict to transform in the onedal estimator (and its use in testing)
Changed structure of fit and fit_transform for the PCA and IncrementalPCA sklearnex estimators to minimize maintenance (no longer duplicating sklearn code nor requiring version checking). This was done by patching fit and required _onedal_fit to return the validated X data for use later.
Simplified the calculations made in _onedal_cpu_supported and _onedal_gpu_supported
fixed problems in the IncrementalPCA patching which had a bug with respect to sparse inputs
sklearn conformance functions added (_validate_n_components, _postprocess_n_components and _compute_noise_variance). Added comments to clarify why things are done for simple future analysis
add get_namespace and modified corresponding validate_data dtypes throughout the code
set onedal estimator factory function as a PCA class method in order to easily create an SPMD interface
Added the necessary PCA SPMD sklearnex interface
Moved modification of the output results from onedal to sklearn for PCA sklearn conformance. This included adding copying behavior, generalized to array API.
created n_features_ property for sklearn conformance, rather than duplicating the same data twice
Just like linear model algorithms like LinearRegression and Ridge, attributes which are required to generate models for prediction/inference are set in a way where modifications in the sklearnex estimator (i.e. by a later fit which falls back to sklearn) are propagated properly to the onedal estimator. This will mean that the output of transform will always reflect the last fit.
check_is_fitted is properly added to IncrementalPCA
missing sklearn conformance is added in the n_components check (which matches fixes for an underlying bug in sklearn IncrementalPCA)
A new parameter is added to the sklearnex IncrementalPCA interface (svd_solver) which exposes the onedal_svd approach which is available in oneDAL. This matches the PCA keyword, but is not available in the sklearn IncrementalPCA implementation. This will require a deselection in test_patching when IncrementalPCA is brought out of preview
The finalize_fit sklearn conformance changes are moved from onedal to sklearnex in IncrementalPCA
re-enable deselected sklearnex tests with respect to design and end-to-end array api tests (test_patching)
Added testing for cases where attributes are modified, or a fallback sklearn fit occurs after a onedal fit.
Removed unnecessary shape restriction on oneDAL support.
Deselected tests related to limitations in the array_api_strict numpy backend dlpack support

No changes in performance are expected.

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

added array-api-compat to test env

added versioning for the get_nnamespace

david-cortes-intel · 2025-07-31T08:01:12Z

sklearnex/decomposition/pca.py

+ # set attributes necessary for calls to transform, will modify
+ # self._onedal_estimator, and clear any previous fit models
+ # Follow guidance from sklearn PCA._fit_full and copy the data
+ self.n_components_ = n_components


@icfaust What happens if the number of non-zero singular values is less than all of (a) user-supplied n_components, (b) number of rows in the data, (c) number of columns in the data ?

Could you please add a test where there's lots of duplicated or linearly dependent columns up to the point where such situation is encountered.

Dropping an example here in case it isn't very clear what kind of inputs would trigger this:

import numpy as np rng = np.random.default_rng(seed=123) m = 4 n = 6 mu = rng.standard_normal(size=n) S = rng.standard_normal(size=(n+1,n)) S = S.T@S X = rng.multivariate_normal(mu, S, size=m) X[:, 3:] = X[:, :3]

This data has 4 rows, 6 columns, and 3 non-zero singular values:

In this case, sklearn doesn't drop the last singular value even though it should, but at least it doesn't end up crashing or anything like that.

icfaust · 2025-08-11T11:53:03Z

/intelci: run

icfaust · 2025-08-11T12:30:07Z

/intelci: run

icfaust · 2025-08-11T22:37:39Z

/intelci: run

icfaust · 2025-08-13T12:41:25Z

/intelci: run

david-cortes-intel · 2025-08-14T06:32:24Z

sklearnex/decomposition/tests/test_pca.py

+ assert not np.array_equal(_as_numpy(est.transform(X)), _as_numpy(est0.transform(X)))
+
+ # copy over parameters necessary for transform
+ est.mean_ = est0.mean_


Not sure if supposed to work, but this crashes when replacing them with arrays of another type. For example:

import dpnp est.mean_ = dpnp.array(est0.mean_, device="gpu", dtype=np.float32) est.components_ = dpnp.array(est0.components_, device="gpu", dtype=np.float32) est.explained_variance_ = dpnp.array(est0.explained_variance_, device="gpu", dtype=np.float32)

david-cortes-intel · 2025-08-14T06:36:32Z

The failing test here has been popping up in other PRs too, but I guess this would be the right PR to address it:

=================================== FAILURES =================================== ______________________ test_pca_dtype_preservation[full] _______________________ svd_solver = 'full' @pytest.mark.parametrize("svd_solver", PCA_SOLVERS) def test_pca_dtype_preservation(svd_solver): > check_pca_float_dtype_preservation(svd_solver) decomposition/tests/test_pca.py:649: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ decomposition/tests/test_pca.py:669: in check_pca_float_dtype_preservation assert_allclose(pca_64.components_, pca_32.components_, rtol=2e-4) utils/_testing.py:292: in assert_allclose np_assert_allclose( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (<function assert_allclose.<locals>.compare at 0x7f62bf73beb0>, array([[-0.62022375, -0.15983497, 0.38316965, 0.6655...4071014, 0.9080595 , -0.21966814], [ 0.12497696, 0.8811203 , -0.16712739, 0.42435375]], dtype=float32)) kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=0.0002, atol=0', 'verbose': True} @wraps(func) def inner(*args, **kwds): with self._recreate_cm(): > return func(*args, **kwds) E AssertionError: E Not equal to tolerance rtol=0.0002, atol=0 E E Mismatched elements: 4 / 12 (33.3%) E Max absolute difference: 0.00014655 E Max relative difference: 0.00087689 E x: array([[-0.620224, -0.159835, 0.38317 , 0.66555 ], E [ 0.263179, 0.240851, 0.908007, -0.21966 ], E [ 0.124977, 0.881087, -0.167274, 0.424366]]) E y: array([[-0.620258, -0.159886, 0.38311 , 0.66554 ], E [ 0.263121, 0.24071 , 0.908059, -0.219668], E [ 0.124977, 0.88112 , -0.167127, 0.424354]], dtype=float32) ../../contextlib.py:79: AssertionError

david-cortes-intel

LGTM pending these two:
#2106 (comment)
#2106 (comment)

icfaust · 2025-08-17T20:22:41Z

/intelci: run

icfaust · 2025-08-17T22:53:42Z

Issues related to pca dtype comparisons is underway in #2666 , the issues related to create_model are more extensive, as there is no prevention of trying to create a model from data on a non SYCL or CPU device, which would lead to odd failures in to_table and should be refactored for linear_regression, Ridge, and PCA algorithms simultaneously with centralized fixtures and testing.

samir-nasibli added enhancement New feature or request Array API labels Oct 13, 2024

This was referenced Oct 13, 2024

ENH: Data management update to support SUA ifaces for Homogen OneDAL tables #2045

Merged

FIX: update functional support fallback logic for a DPNP/DPCTL ndarray inputs #2113

Merged

DOC: Array API support #1918

Merged

samir-nasibli mentioned this pull request Nov 5, 2024

[enhancement] add oneDAL finiteness_checker implementation to onedal #2126

Merged

13 tasks

icfaust mentioned this pull request Nov 25, 2024

[enhancement] remove contiguous check from _check_array #2185

Merged

13 tasks

samir-nasibli assigned icfaust Dec 6, 2024

icfaust mentioned this pull request Jan 31, 2025

[enhancement] add dlpack support to to_table #2275

Merged

13 tasks

icfaust mentioned this pull request Jun 17, 2025

[bugfix] fix array type issues in pca.inverse_transform without array_api_dispatch enabled #2555

Merged

13 tasks

samir-nasibli added 20 commits July 16, 2025 01:42

ENH: array api dispatching

a55282d

added array-api-compat to test env

Deselect some scikit-learn Array API tests

868f596

deselect more tests

aa0ab2c

deselect more tests

1d2916b

disabled tests for

66d6fa6

fix the deselection comment

b950c16

disabled test for Ridge regression

a74eaab

Disabled tests and added comment

8219d6f

ENH: Array API dispatching

14d8ab1

Revert adding dpctl into Array PI conformance testing

9569fca

added versioning for the get_nnamespace

minor refactoring onedal _array_api

8e533ab

add tests

106281a

addressed memory usage tests

5fa57f9

Address some array api test fails

94d80eb

linting

cb2de20

addressed test_get_namespace

ab8c67f

adding test case for validate_data check with Array API inputs

060db7b

minor refactoring

a5c6f32

addressed test_patch_map_match fail

fee3838

Added docstrings for get_namespace

283aefe

icfaust added 3 commits July 30, 2025 05:02

add skip

7406fe7

Update test_pca.py

6d30453

Update test_incremental_pca.py

9f429f1

david-cortes-intel reviewed Jul 31, 2025

View reviewed changes

icfaust added 5 commits July 31, 2025 15:26

Update test_pca.py

66f0b85

Update test_incremental_pca.py

dd7c46b

merge main

9c82e34

Merge remote-tracking branch 'sklearnex/main' into enh/pca_array_api

4b79e16

Update pca.py

0379f11

Update pca.py

d30764d

icfaust added 2 commits August 11, 2025 23:58

Update pca.py

7a230e6

Update pca.py

17afc2c

Update deselected_tests.yaml

c3b0149

icfaust added 6 commits August 13, 2025 07:53

Merge remote-tracking branch 'sklearnex/main' into enh/pca_array_api

90b87cc

add merge and fixes

abd13dd

swap based on comments

08bbda5

swap based on comments

46a9486

add numpy check

7f2f852

make testing changes

f5de882

david-cortes-intel reviewed Aug 14, 2025

View reviewed changes

david-cortes-intel approved these changes Aug 14, 2025

View reviewed changes

icfaust approved these changes Aug 17, 2025

View reviewed changes

icfaust merged commit 35ccc0c into uxlfoundation:main Aug 17, 2025
28 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: `PCA` via Array API based on #2096 #2106

ENH: `PCA` via Array API based on #2096 #2106

Uh oh!

samir-nasibli commented Oct 13, 2024 •

edited by icfaust

Loading

david-cortes-intel Jul 31, 2025

david-cortes-intel Aug 8, 2025

icfaust commented Aug 11, 2025

icfaust commented Aug 11, 2025

icfaust commented Aug 11, 2025

icfaust commented Aug 13, 2025

david-cortes-intel Aug 14, 2025

david-cortes-intel commented Aug 14, 2025

david-cortes-intel left a comment

icfaust commented Aug 17, 2025

icfaust commented Aug 17, 2025

Uh oh!

Labels

5 participants

ENH: PCA via Array API based on #2096 #2106

ENH: PCA via Array API based on #2096 #2106

Uh oh!

Conversation

samir-nasibli commented Oct 13, 2024 • edited by icfaust Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

david-cortes-intel Jul 31, 2025

Choose a reason for hiding this comment

david-cortes-intel Aug 8, 2025

Choose a reason for hiding this comment

icfaust commented Aug 11, 2025

icfaust commented Aug 11, 2025

icfaust commented Aug 11, 2025

icfaust commented Aug 13, 2025

david-cortes-intel Aug 14, 2025

Choose a reason for hiding this comment

david-cortes-intel commented Aug 14, 2025

david-cortes-intel left a comment

Choose a reason for hiding this comment

icfaust commented Aug 17, 2025

icfaust commented Aug 17, 2025

Uh oh!

Labels

5 participants

ENH: `PCA` via Array API based on #2096 #2106

ENH: `PCA` via Array API based on #2096 #2106

samir-nasibli commented Oct 13, 2024 •

edited by icfaust

Loading