- Notifications
You must be signed in to change notification settings - Fork 202
Open
Description
Describe the bug
The PPBR_AZ dataset returns a different number of molecules depending whether it is loaded via the admet_group benchmark or the ADME module. Additionally, the data loaded through admet_group does not match documentation.
PPBR_AZfromADMEhas 1614 molecules (train - 1130, valid - 161, test - 323).- According to
admet_groupdocumentationPPBR_AZshould contain 1797 molecules in total. - When loaded via
admed_group, the dataset contains 2790 molecules (2231 train_valid, 559 test).
To Reproduce
from tdc.benchmark_group import admet_group group = admet_group(path="data/") benchmark = group.get('PPBR_AZ') benchmark['train_val'] # <-- 2231 rows dataframe benchmark['test'] # <-- 559 rows, including duplicated molecules with different binding rates from tdc.single_pred import ADME data = ADME(name='PPBR_AZ') train_split = data.get_split()['train'] # <-- 1130 molecules valid_split = data.get_split()['valid'] # <-- 161 molecules test_split = data.get_split()['test'] # <-- 323 moleculesExpected behavior
- Consistency between the dataset loader and benchmark (or clear explanation in the documentation why this dataset behaves differently).
- Alignment between benchmark documentation and the actual data returned.
- No duplicated molecules with conflicting outputs in the test split.
Screenshots
Environment:
- OS: Linux 6.14.10-arch1-1
- Python version: 3.11.13
- TDC version: 1.1.15
j-adamczyk and dipendrapant
Metadata
Metadata
Assignees
Labels
No labels
