Skip to content

Conversation

@falcaopetri
Copy link
Contributor

What does this PR do?

This PR fixes the evaluation loop in run_mlm_flax_stream.py. Current behavior didn't update the correct variable, which leads to data leakage during evaluation.

It also takes the opportunity to improve some DataTrainingArguments usages.


It's a draft PR because there is an open improvement that could be made: the script splits train-eval based solely in data_args.{dataset_name,num_eval_samples}, but also accepts unused args train_file, validation_file, train_ref_file, validation_ref_file, validation_split_percentage. Other data args that are unused: pad_to_max_length, line_by_line.

My suggestion would be to remove all these unused args. May I proceed with that?

Before submitting

Who can review?

@patrickvonplaten

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Dec 12, 2021
@falcaopetri falcaopetri marked this pull request as ready for review December 14, 2021 01:48
@github-actions github-actions bot closed this Dec 22, 2021
@patrickvonplaten
Copy link
Contributor

@patil-suraj could you take a look here? :-)

@huggingface huggingface deleted a comment from github-actions bot Jan 16, 2022
@patil-suraj patil-suraj added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jan 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

3 participants