- Notifications
You must be signed in to change notification settings - Fork 460
Experimental datumaro implementation #4920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
…ing_extensions into feature/datumaro
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…4751) Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Grégoire Payen de La Garanderie <gdlg@hochet.info>
…ing_extensions into feature/datumaro # Conflicts: # library/src/otx/data/dataset/base_new.py # library/src/otx/data/dataset/classification_new.py # library/src/otx/data/dataset/detection_new.py # library/src/otx/data/dataset/instance_segmentation_new.py # library/src/otx/data/dataset/keypoint_detection_new.py # library/src/otx/data/dataset/segmentation_new.py # library/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…ing_extensions into feature/datumaro
…ing_extensions into feature/datumaro
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com> Co-authored-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…orm/training_extensions into feature/datumaro
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
Signed-off-by: Albert van Houten <albert.van.houten@intel.com>
…ing_extensions into feature/datumaro Signed-off-by: Albert van Houten <albert.van.houten@intel.com> # Conflicts: # library/pyproject.toml # library/src/otx/data/dataset/anomaly.py # library/src/otx/data/dataset/base.py # library/src/otx/data/dataset/classification.py # library/src/otx/data/dataset/detection.py # library/src/otx/data/dataset/instance_segmentation.py # library/src/otx/data/dataset/keypoint_detection.py # library/src/otx/data/dataset/segmentation.py # library/src/otx/data/dataset/tile.py # library/src/otx/data/factory.py # library/src/otx/data/module.py # library/src/otx/data/transform_libs/torchvision.py # library/tests/unit/data/samplers/test_class_incremental_sampler.py # library/tests/unit/data/utils/test_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an experimental Datumaro dataset integration for OTX, transitioning from legacy Datumaro components to the new experimental Dataset API. The changes introduce a new sample-based architecture while maintaining compatibility with existing OTX functionality.
Key changes:
- Migration from legacy Datumaro components to experimental Dataset API with schema-based conversion
- Introduction of new OTXSample-based data entities with PyTree registration for TorchVision compatibility
- Replacement of legacy polygon handling with numpy ragged arrays for better performance
- Comprehensive test updates and new test implementations for the updated dataset architecture
Reviewed Changes
Copilot reviewed 59 out of 59 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Updates Datumaro dependency to experimental branch version |
| library/src/otx/data/dataset/*.py | Implements new dataset classes using experimental Datumaro with sample-based architecture |
| library/src/otx/data/entity/sample.py | Adds new OTXSample classes with PyTree registration for transform compatibility |
| library/tests/unit/types/test_label.py | Updates label tests to use new hierarchical label categories |
| library/tests/unit/data/transform_libs/test_torchvision.py | Converts polygon handling from Datumaro objects to numpy arrays |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fake_polygons = np.array( | ||
| [ | ||
| np.array([[10, 10], [50, 10], [50, 50], [10, 50]]), # Rectangle polygon for first object | ||
| np.array([[60, 60], [100, 60], [100, 100], [60, 100]]), # Rectangle polygon for second object | ||
| ] | ||
| ) |
Copilot AI Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The polygon data structure has changed from Datumaro Polygon objects to numpy ragged arrays, but the test doesn't validate that the new format maintains the same geometric properties. Consider adding assertions to verify that the polygon areas and vertex coordinates are equivalent between the old and new formats.
| p = np.asarray(polygon.points) | ||
| p[0::2] = width - p[0::2] | ||
| return p.tolist() | ||
| def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray: |
Copilot AI Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function modifies the input polygon array in-place, which could cause issues if the original array is used elsewhere. The function should create a copy before modification to avoid unintended side effects.
| def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray: | |
| def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray: | |
| polygon = polygon.copy() |
| rescaled_polygons[i] = scaled_points | ||
| else: | ||
| # Handle empty or invalid polygons | ||
| rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) |
Copilot AI Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Creating a dummy polygon with three identical points may not be the best approach for handling empty/invalid polygons. Consider using an empty array or a clearly marked invalid polygon structure that can be properly filtered out later in the pipeline.
| rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) | |
| rescaled_polygons[i] = np.empty((0, 2), dtype=np.float32) |
| # Handle empty or invalid polygons | ||
| translated_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32) |
Copilot AI Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Same issue as in rescale_polygons - using dummy points for invalid polygons. This pattern is repeated in multiple functions and should be standardized across the codebase for consistency.
# Conflicts: # library/tests/unit/data/dataset/test_base.py # library/tests/unit/data/test_tiling.py
This pull request introduces the new experimental dataset into the dataset classes of OTX. A lot of logic has been migrated to datumaro such as management of image color channels and the tiling implementation.
Summary
How to test
Checklist