Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@dsmilkov
Copy link
Contributor

@dsmilkov dsmilkov commented Jun 18, 2019

Allow serialization and deserialization of string weights in the binary weights format.

This is needed to enable execution of AutoML models, which store the vocab as a weight (Const of dtype string).

Corresponding change in tfjs-core: tensorflow/tfjs-core#1816

Fixes tensorflow/tfjs#1598


This change is Reviewable

@dsmilkov
Copy link
Contributor Author

CLA is ok since I'm a Googler (the confusion is because I didn't use email address for the first 3 commits - they were made from a new laptop)

@dsmilkov dsmilkov changed the title WIP Add support for string weights Add support for string weights Jun 18, 2019
@dsmilkov dsmilkov requested review from caisq and pyu10055 June 18, 2019 13:38
@tensorflow tensorflow deleted a comment from googlebot Jun 18, 2019
@tensorflow tensorflow deleted a comment from googlebot Jun 18, 2019
dsmilkov added a commit to tensorflow/tfjs-core that referenced this pull request Jun 19, 2019
FEATURE - Add support for deserializing weights of dtype `string` - Allow string tensors in `tile` and `expandDims` ops (these ops are used by AutoML) - Add utf-8 encoding/decoding capability to `Platform` This is needed to enable execution of AutoML models, which store the vocab as a weight (`Const` tensor) of dtype `string`. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386
@dsmilkov
Copy link
Contributor Author

@pyu10055 This is ready for review. Added the chardet library to detect the encoding and unit tests with np.arrays with different encodings.

@tensorflow tensorflow deleted a comment from googlebot Jun 20, 2019
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Copy link
Contributor

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):

 np.object: 

Can we check weight['dtype'] instead? np.object is an implementation detail.


python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):

[['test', 'a'], ['b', 'c']], 

Same as below. Add test for empty strings and empty string tensors (if supported).


python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):

'test' 

Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?


python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):

def _serialize_string_array(data, delimiter): 

Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A


python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):

unicode_strings 

Variable name nit: --> decoded_strings


python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):

delimiter.join(unicode_strings) 

Should we check if unicode_strings contain delimiter and throw an error if that's the case?


python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):

STRING_DELIMITER 

So we are not opening the option for the user to specify the delimiter yet?


python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

can_skip_weight = (skip_op_check and 

Why can this code be deleted without breaking any features?

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good review! PTAL.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)


python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
 np.object: 

Can we check weight['dtype'] instead? np.object is an implementation detail.

Done.


python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
[['test', 'a'], ['b', 'c']], 

Same as below. Add test for empty strings and empty string tensors (if supported).

Done. Added a test below.


python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
'test' 

Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?

Done.


python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
def _serialize_string_array(data, delimiter): 

Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A

Done.


python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
unicode_strings 

Variable name nit: --> decoded_strings

Done.


python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
delimiter.join(unicode_strings) 

Should we check if unicode_strings contain delimiter and throw an error if that's the case?

Done. Added a unit test.


python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
STRING_DELIMITER 

So we are not opening the option for the user to specify the delimiter yet?

Correct. Hopefully there is no need to change this delimiter ever, but it's good to have the escape hatch for later.


python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
can_skip_weight = (skip_op_check and 

Why can this code be deleted without breaking any features?

Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.

Copy link
Contributor

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 1 of 5 files at r3.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):

object 

should be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?

Copy link
Contributor

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq and @dsmilkov)


python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.

Ah - OK. I should have realized that.

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):

Previously, pyu10055 (Ping Yu) wrote…
object 

should be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?

Good question, but I didn't see any specific need for that since utf-8 can encode any unicode code point. Using a single type of encoding (utf-8) on the serving side simplifies deployment beyond the browser (wechat, react native, any future platforms)

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL. I changed the string serialization format to align with TF C++. We now store the bytes of the string with 4 bytes denoting the string length.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq and @pyu10055)

Copy link
Contributor

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

32-bit integer 

This is surprising. Why not a uint64? Is it because of JavaScript limitation?

Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.


python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...] 

Does bytes of s* ... include the delimiter?


python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):

A np.array o 

offset is not documented here.

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)


python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…
32-bit integer 

This is surprising. Why not a uint64? Is it because of JavaScript limitation?

Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.

Talked offline. Yes, we can't represent uint64 in JS natively unless we implement our own IEEE-754. Changed to uint32 which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).


python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…
[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...] 

Does bytes of s* ... include the delimiter?

There is no longer a delimiter since the size of the string is explained by the byte length.


python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…
A np.array o 

offset is not documented here.

Thanks! Added.

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, changed to uint32. In tfjs-core, I will document the protocol in the WeightsManifest interface.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)

Copy link
Contributor

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Talked offline. Yes, we can't represent uint64 in JS natively unless we implement our own IEEE-754. Changed to uint32 which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).

SGTM.


python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

There is no longer a delimiter since the size of the string is explained by the byte length.

OK. Thanks for the explanation.


python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):

from tensorflowjs.read_weights import STRING_LENGTH_DTYPE 

As per Google Python style guide, import the module instead, i.e.,

from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...

See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision


python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):

 

style nit: In-line comments should follow two spaces, not one. Same else where.

Copy link
Contributor

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @caisq and @pyu10055)


python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):

Previously, caisq (Shanqing Cai) wrote…
from tensorflowjs.read_weights import STRING_LENGTH_DTYPE 

As per Google Python style guide, import the module instead, i.e.,

from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...

See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision

Done.


python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):

Previously, caisq (Shanqing Cai) wrote…
 

style nit: In-line comments should follow two spaces, not one. Same else where.

Done.

@dsmilkov dsmilkov merged commit 7ab02c7 into master Jun 27, 2019
@dsmilkov dsmilkov deleted the string-weights branch June 27, 2019 03:00
dsmilkov added a commit to tensorflow/tfjs-core that referenced this pull request Jun 27, 2019
FEATURE To align with TensorFlow Python/C++, this PR changes the way we serialize strings in the weights format, and in our engine. - We store the underlying encoded string bytes as `Uint8Array`. Thus a string tensors (which has multiple strings) is backed by `Uint8Array[]`. - To keep backwards compatibility, `tensor.data()` returns `string[]`, which means that we try to utf-8 decode a string. - In the weights format, string bytes are kept unchanged, with their original encoding. Each string is prefixed with 4 bytes denoting the number of bytes in the string. Thus, a string tensor of 3 values will be encoded as `[4 bytes][string1...][4 bytes][string2...][4 bytes][string3....]` - Add `util.encodeString(text: string, encoding?: string)` and `util.decodeString(bytes: Uint8Array, encoding?: string)`, along with the respective `Platform` methods - Add `tensor.bytes()` which gives the underlying bytes of the data - `Uint8Array` for any numeric tensor, and `Uint8Array[]` for string tensors. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

5 participants