Skip to content

Conversation

@lcy-seso
Copy link
Contributor

@lcy-seso lcy-seso commented Jan 18, 2018

  • Use moses to tokenize the raw input data.
  • Need to add comments.

You can use this dataset like this:

train_data = paddle.batch( paddle.reader.shuffle( paddle.dataset.wmt16.train( src_dict_size=10000, trg_dict_size=10000, src_lang="en"), buf_size=1000), batch_size=batch_size)
@lcy-seso lcy-seso force-pushed the wmt16_en_ger branch 2 times, most recently from 6a387f6 to 0775d65 Compare January 19, 2018 09:00
@lcy-seso lcy-seso requested a review from guoshengCS January 19, 2018 09:11
UNK_MARK = "<unk>"


def __build_dict__(tar_file, dict_size, save_path, lang):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful about the naming style, since built-in functions in python are always named to __XXX__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Do you think it is necessary to change the function named __xx__ into __xx also in other datasets (like wmt14) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that naming a function to _xx is enough to declare the function to be private. I have no idea whether __xx is a better naming style. It would be better to unify the naming style, however it is a tedious work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. According to Google style, https://google.github.io/styleguide/pyguide.html#Naming I think __xx is ok. I will have a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@guoshengCS guoshengCS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcy-seso lcy-seso merged commit 430fdc5 into PaddlePaddle:develop Jan 22, 2018
@lcy-seso lcy-seso deleted the wmt16_en_ger branch January 22, 2018 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants