Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

@ricsinaruto
Copy link

For a while, I've been using tensor2tensor to do dialog modeling research, and I integrated some dialog problems in my own repo, but maybe others could also benefit from these if they were in the official tensor2tensor repo.

All problems are set up in a single-turn fashion without any additional inputs other than the utterances, so basically like a translate problem. I trained a transformer model on all problems and got decent results (see my repo for more details).

The PR contains 5 new files:

  • dialog_abstract: An abstract base class for all my other dialog problems, since a lot of functionality is shared. I believe this can also be used for other dialog datasets or problems. Subclasses should implement the preprocess_data function, which just sets up the names of data files and download url, and the create_data function, where the vocab file is built, the dataset is preprocessed and converted to be further processed by t2t-datagen.
  • dialog_cornell: Implements the Cornell Movie-Dialogs Corpus.
  • dialog_opensubtitles: Implements the Opensubtitles corpus. There are separate classes for different versions of this dataset (denoted by the year), e.g. dialog_opensubtitles64k2012.
  • dialog_dailydialog: Implements the DailyDialog dataset.
  • dialog_personachat: Implements the original Persona-Chat dataset.

All problems have basic preprocessing before data generation (lowering, etc.).
At the end of each registered problem name, I put the size of the vocab (e.g. dialog_dailydialog16k).

These problems depend on the clint package to visualize data downloading progress, should this be added to the setup.py script? (as I didn't see it there)

@googlebot googlebot added the cla: yes PR author has signed CLA label Jul 24, 2019
@ricsinaruto
Copy link
Author

ricsinaruto commented Sep 9, 2019

Can I get an update on this PR, what is required on my part to be able to merge it?

If there is no desire to add these problems to tensor2tensor, then let's close it, but I would like to get some feedback on what we should do with it because it just sits here since July. (@lukaszkaiser)

@afrozenator
Copy link
Contributor

Hi @ricsinaruto -- We're still working on our end to merge in this PR.

Why this is getting delayed:

  • the pylint we use internally doesn't like the formatting/doc format etc so that needs to be done.
    • You can run pylint using the pylintrc we're using by running pylint -j 16 tensor2tensor
  • internally we write .bazel build rules for new files added.
  • We'll probably skip clint since we don't want to add too many dependencies (should be a small change)

Really appreciate your contribution and sorry for the late update!

@ricsinaruto
Copy link
Author

Awesome, thanks for the clarification! Let me know if I can do anything 😄

@afrozenator
Copy link
Contributor

The only thing you can probably do, and would be a great help for me, is to run pylint with our pylintrc file https://github.com/tensorflow/tensor2tensor/blob/master/pylintrc on the newly added files and fix those and push it?

The other thing I forgot to write (and I'm not sure applies here) is that for a brief bit of time we need to be both py2 and py3 compliant.

@ricsinaruto
Copy link
Author

I think I fixed everything

@afrozenator
Copy link
Contributor

afrozenator commented Sep 10, 2019 via email

@afrozenator
Copy link
Contributor

Ok, giving this another shot, but can you resolve the conflicts with head?

@afrozenator
Copy link
Contributor

One further request, can you format docstrings like so:

 def preprocess_example(self, example, mode, hparams): """Runtime preprocessing. Return a dict or a tf.data.Dataset.from_tensor_slices (if you want each example to turn into multiple). Args: example: dict, features mode: tf.estimator.ModeKeys hparams: HParams, model hyperparameters Returns: dict or Dataset """ 

Some notes:

  • Docstring should start with """ [g-docstring-quotes]
  • One-line docstring summary should be followed by a blank line [g-no-space-after-docstring-summary]
  • Also the formatting of the args.

Let me know if you can do this?

Thanks again for the addition!

@afrozenator
Copy link
Contributor

I'm surprised our pylintrc didn't report these -- are you sure it was picked up?

@ricsinaruto
Copy link
Author

I'll resolve conflicts and reformat the docstrings as soon as possible. I ran the pylintrc that you linked, and it said everything was okay after the fixes that I committed, so no idea.

Richárd Krisztián Csáky added 3 commits September 12, 2019 19:15
@ricsinaruto
Copy link
Author

Travis doesn't seem to fail due to my code, I'm not entirely sure.

@afrozenator
Copy link
Contributor

afrozenator commented Sep 12, 2019 via email

@ricsinaruto
Copy link
Author

ricsinaruto commented Sep 12, 2019

Thanks! How can I get these though?
I ran the following command:
pylint --rcfile=pylintrc "path-to-files"

@afrozenator
Copy link
Contributor

afrozenator commented Sep 12, 2019 via email

@ricsinaruto
Copy link
Author

image
I really don't understand what I'm missing here :(

@ricsinaruto
Copy link
Author

I tried to fix most of the things that you pasted above.

@afrozenator
Copy link
Contributor

afrozenator commented Sep 13, 2019 via email

@afrozenator
Copy link
Contributor

Thanks a lot @ricsinaruto ! That really fixed almost everything, now running the tests internally and will merge in shortly -- many thanks for patiently working on fixing everything -- and a huge thanks for contributing the problems in the first place!

@ricsinaruto
Copy link
Author

Awesome, I'm happy to contribute, and that they will finally make a part of tensor2tensor :)

@afrozenator afrozenator merged commit 9f87f15 into tensorflow:master Sep 13, 2019
@afrozenator
Copy link
Contributor

Done! Thanks a lot for the work!!

tensorflow-copybara pushed a commit that referenced this pull request Sep 13, 2019
PiperOrigin-RevId: 268923712
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla: yes PR author has signed CLA

3 participants