PPBR_AZ Benchmark and dataset mismatch

Describe the bug

The PPBR_AZ dataset returns a different number of molecules depending whether it is loaded via the admet_group benchmark or the ADME module. Additionally, the data loaded through admet_group does not match documentation.

PPBR_AZ from ADME has 1614 molecules (train - 1130, valid - 161, test - 323).
According to admet_group documentation PPBR_AZ should contain 1797 molecules in total.
When loaded via admed_group, the dataset contains 2790 molecules (2231 train_valid, 559 test).

To Reproduce

from tdc.benchmark_group import admet_group group = admet_group(path="data/") benchmark = group.get('PPBR_AZ') benchmark['train_val'] # <-- 2231 rows dataframe benchmark['test'] # <-- 559 rows, including duplicated molecules with different binding rates from tdc.single_pred import ADME data = ADME(name='PPBR_AZ') train_split = data.get_split()['train'] # <-- 1130 molecules valid_split = data.get_split()['valid'] # <-- 161 molecules test_split = data.get_split()['test'] # <-- 323 molecules

Expected behavior

Consistency between the dataset loader and benchmark (or clear explanation in the documentation why this dataset behaves differently).
Alignment between benchmark documentation and the actual data returned.
No duplicated molecules with conflicting outputs in the test split.

Screenshots

Environment:

OS: Linux 6.14.10-arch1-1
Python version: 3.11.13
TDC version: 1.1.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PPBR_AZ Benchmark and dataset mismatch #377

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PPBR_AZ Benchmark and dataset mismatch #377

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions