Skip to content

Conversation

@gustavocidornelas
Copy link
Contributor

Refactors the dataset validations and introduces dataset types (for training and validation sets).

Summary

  • Factored out the validations done for the datasets to the DatasetValidator class, analogous to the ModelValidator introduced in a previous PR.
  • Additionally, introduced the dataset_type argument, which can be either DatasetType.Training or DatasetType.Validation.

Additional comments

  • I'll finish updating the docstrings once we have the new commit-style in place.
  • I'm not yet fully satisfied with how we pass the args to the add_dataset. I introduced the possibility of passing it via a dataset_config.yaml file (analogous to the model_config.yaml for models), but it still feels a bit awkward. I'll continue thinking about it.
@linear
Copy link

linear bot commented Jan 19, 2023

OPEN-3366 Refactor dataset validations

Follow the same style as the new model package validations

file_path: str,
class_names: List[str],
label_column_name: str,
dataset_type: DatasetType = DatasetType.Validation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wouldn't set a default for this to encourage users to think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I'll change it

dataset_type: DatasetType = DatasetType.Validation,
feature_names: List[str] = [],
text_column_name: Optional[str] = None,
categorical_feature_names: List[str] = [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why'd we decide to not use a .yaml for this one but use it for model? bc it could be in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not final. I feel like these args are still a bit awkward. I added the support for a yaml for consistency with the model upload. Since there, the user is constructing a model package, it made sense to ask for files.

In the dataset case, the user is not preparing a package and, as you said, a lot of this info could be in memory (especially if they're using the add_dataframe method). I'm going to be re-thinking how to pass these args in the next few days.

language: str = "en",
sep: str = ",",
commit_message: Optional[str] = None,
dataset_config_file_path: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need the docstring for this prop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since this is prob. not be final, I'll update the docstrings later. I'll have to re-write the docstrings to follow the new commit-style anyway, so I'll make sure to update it all together. (Wrote down on my to-dos so I don't forget)

)
except ma.ValidationError as err:
# ---------------------------- Dataset validations --------------------------- #
# TODO: re-think the way the arguments are passed for the dataset upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah

@gustavocidornelas gustavocidornelas force-pushed the cid/refactor-dataset-validations branch from 76ffacd to 5a98e10 Compare January 20, 2023 11:18
@gustavocidornelas gustavocidornelas merged commit 5435781 into mvp Jan 21, 2023
@gustavocidornelas gustavocidornelas deleted the cid/refactor-dataset-validations branch January 21, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants