Skip to content

Conversation

@AlbertvanHouten
Copy link
Contributor

@AlbertvanHouten AlbertvanHouten commented Oct 20, 2025

This pull request introduces the new experimental dataset into the dataset classes of OTX. A lot of logic has been migrated to datumaro such as management of image color channels and the tiling implementation.

Summary

How to test

Checklist

  • The PR title and description are clear and descriptive
  • I have manually tested the changes
  • All changes are covered by automated tests
  • All related issues are linked to this PR (if applicable)
  • Documentation has been updated (if applicable)
AlbertvanHouten and others added 16 commits September 4, 2025 10:10
Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…4751) Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Grégoire Payen de La Garanderie <gdlg@hochet.info>
…ing_extensions into feature/datumaro # Conflicts: #	library/src/otx/data/dataset/base_new.py #	library/src/otx/data/dataset/classification_new.py #	library/src/otx/data/dataset/detection_new.py #	library/src/otx/data/dataset/instance_segmentation_new.py #	library/src/otx/data/dataset/keypoint_detection_new.py #	library/src/otx/data/dataset/segmentation_new.py #	library/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
@github-actions github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests BUILD labels Oct 20, 2025
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…ing_extensions into feature/datumaro Signed-off-by: Albert van Houten <albert.van.houten@intel.com> # Conflicts: #	library/pyproject.toml #	library/src/otx/data/dataset/anomaly.py #	library/src/otx/data/dataset/base.py #	library/src/otx/data/dataset/classification.py #	library/src/otx/data/dataset/detection.py #	library/src/otx/data/dataset/instance_segmentation.py #	library/src/otx/data/dataset/keypoint_detection.py #	library/src/otx/data/dataset/segmentation.py #	library/src/otx/data/dataset/tile.py #	library/src/otx/data/factory.py #	library/src/otx/data/module.py #	library/src/otx/data/transform_libs/torchvision.py #	library/tests/unit/data/samplers/test_class_incremental_sampler.py #	library/tests/unit/data/utils/test_utils.py
@AlbertvanHouten AlbertvanHouten changed the title Feature/datumaro Experimental datumaro implementation Oct 29, 2025
@AlbertvanHouten AlbertvanHouten marked this pull request as ready for review October 29, 2025 09:24
@AlbertvanHouten AlbertvanHouten requested a review from a team as a code owner October 29, 2025 09:24
Copilot AI review requested due to automatic review settings October 29, 2025 09:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an experimental Datumaro dataset integration for OTX, transitioning from legacy Datumaro components to the new experimental Dataset API. The changes introduce a new sample-based architecture while maintaining compatibility with existing OTX functionality.

Key changes:

  • Migration from legacy Datumaro components to experimental Dataset API with schema-based conversion
  • Introduction of new OTXSample-based data entities with PyTree registration for TorchVision compatibility
  • Replacement of legacy polygon handling with numpy ragged arrays for better performance
  • Comprehensive test updates and new test implementations for the updated dataset architecture

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyproject.toml Updates Datumaro dependency to experimental branch version
library/src/otx/data/dataset/*.py Implements new dataset classes using experimental Datumaro with sample-based architecture
library/src/otx/data/entity/sample.py Adds new OTXSample classes with PyTree registration for transform compatibility
library/tests/unit/types/test_label.py Updates label tests to use new hierarchical label categories
library/tests/unit/data/transform_libs/test_torchvision.py Converts polygon handling from Datumaro objects to numpy arrays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +143 to +148
fake_polygons = np.array(
[
np.array([[10, 10], [50, 10], [50, 50], [10, 50]]), # Rectangle polygon for first object
np.array([[60, 60], [100, 60], [100, 100], [60, 100]]), # Rectangle polygon for second object
]
)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polygon data structure has changed from Datumaro Polygon objects to numpy ragged arrays, but the test doesn't validate that the new format maintains the same geometric properties. Consider adding assertions to verify that the polygon areas and vertex coordinates are equivalent between the old and new formats.

Copilot uses AI. Check for mistakes.
p = np.asarray(polygon.points)
p[0::2] = width - p[0::2]
return p.tolist()
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function modifies the input polygon array in-place, which could cause issues if the original array is used elsewhere. The function should create a copy before modification to avoid unintended side effects.

Suggested change
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
polygon = polygon.copy()
Copilot uses AI. Check for mistakes.
rescaled_polygons[i] = scaled_points
else:
# Handle empty or invalid polygons
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Creating a dummy polygon with three identical points may not be the best approach for handling empty/invalid polygons. Consider using an empty array or a clearly marked invalid polygon structure that can be properly filtered out later in the pipeline.

Suggested change
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
rescaled_polygons[i] = np.empty((0, 2), dtype=np.float32)
Copilot uses AI. Check for mistakes.
Comment on lines +352 to +353
# Handle empty or invalid polygons
translated_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Same issue as in rescale_polygons - using dummy points for invalid polygons. This pattern is repeated in multiple functions and should be standardized across the codebase for consistency.

Copilot uses AI. Check for mistakes.
# Conflicts: #	library/tests/unit/data/dataset/test_base.py #	library/tests/unit/data/test_tiling.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests

4 participants