Skip to content

PPBR_AZ Benchmark and dataset mismatch #377

@Thematiq

Description

@Thematiq

Describe the bug

The PPBR_AZ dataset returns a different number of molecules depending whether it is loaded via the admet_group benchmark or the ADME module. Additionally, the data loaded through admet_group does not match documentation.

  • PPBR_AZ from ADME has 1614 molecules (train - 1130, valid - 161, test - 323).
  • According to admet_group documentation PPBR_AZ should contain 1797 molecules in total.
  • When loaded via admed_group, the dataset contains 2790 molecules (2231 train_valid, 559 test).

To Reproduce

from tdc.benchmark_group import admet_group group = admet_group(path="data/") benchmark = group.get('PPBR_AZ') benchmark['train_val'] # <-- 2231 rows dataframe benchmark['test'] # <-- 559 rows, including duplicated molecules with different binding rates from tdc.single_pred import ADME data = ADME(name='PPBR_AZ') train_split = data.get_split()['train'] # <-- 1130 molecules valid_split = data.get_split()['valid'] # <-- 161 molecules test_split = data.get_split()['test'] # <-- 323 molecules

Expected behavior

  • Consistency between the dataset loader and benchmark (or clear explanation in the documentation why this dataset behaves differently).
  • Alignment between benchmark documentation and the actual data returned.
  • No duplicated molecules with conflicting outputs in the test split.

Screenshots

Image

Environment:

  • OS: Linux 6.14.10-arch1-1
  • Python version: 3.11.13
  • TDC version: 1.1.15

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions