Add support for string weights #386

dsmilkov · 2019-06-18T13:35:14Z

Allow serialization and deserialization of string weights in the binary weights format.

This is needed to enable execution of AutoML models, which store the vocab as a weight (Const of dtype string).

Corresponding change in tfjs-core: tensorflow/tfjs-core#1816

This change is

dsmilkov · 2019-06-18T13:36:05Z

CLA is ok since I'm a Googler (the confusion is because I didn't use email address for the first 3 commits - they were made from a new laptop)

FEATURE - Add support for deserializing weights of dtype `string` - Allow string tensors in `tile` and `expandDims` ops (these ops are used by AutoML) - Add utf-8 encoding/decoding capability to `Platform` This is needed to enable execution of AutoML models, which store the vocab as a weight (`Const` tensor) of dtype `string`. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386

dsmilkov · 2019-06-20T12:36:36Z

@pyu10055 This is ready for review. Added the chardet library to detect the encoding and unit tests with np.arrays with different encodings.

googlebot · 2019-06-20T18:02:12Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

caisq

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):

 np.object:

Can we check weight['dtype'] instead? np.object is an implementation detail.

python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):

[['test', 'a'], ['b', 'c']],

Same as below. Add test for empty strings and empty string tensors (if supported).

python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):

'test'

Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?

python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):

def _serialize_string_array(data, delimiter):

Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A

python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):

unicode_strings

Variable name nit: --> decoded_strings

python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):

delimiter.join(unicode_strings)

Should we check if unicode_strings contain delimiter and throw an error if that's the case?

python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):

STRING_DELIMITER

So we are not opening the option for the user to specify the delimiter yet?

python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

can_skip_weight = (skip_op_check and

Why can this code be deleted without breaking any features?

dsmilkov

Thanks for the good review! PTAL.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)

python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

 np.object: 
Can we check weight['dtype'] instead? np.object is an implementation detail.

Done.

python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

[['test', 'a'], ['b', 'c']], 
Same as below. Add test for empty strings and empty string tensors (if supported).

Done. Added a test below.

python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

'test' 
Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?

Done.

python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

def _serialize_string_array(data, delimiter): 
Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A

Done.

python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

unicode_strings 
Variable name nit: --> decoded_strings

Done.

python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

delimiter.join(unicode_strings) 
Should we check if unicode_strings contain delimiter and throw an error if that's the case?

Done. Added a unit test.

python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

STRING_DELIMITER 
So we are not opening the option for the user to specify the delimiter yet?

Correct. Hopefully there is no need to change this delimiter ever, but it's good to have the escape hatch for later.

python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…

can_skip_weight = (skip_op_check and 
Why can this code be deleted without breaking any features?

Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.

pyu10055

Reviewed 1 of 5 files at r1, 1 of 5 files at r3.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):

object

should be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?

caisq

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @caisq and @dsmilkov)

python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.

Ah - OK. I should have realized that.

dsmilkov

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

object 
should be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?

Good question, but I didn't see any specific need for that since utf-8 can encode any unicode code point. Using a single type of encoding (utf-8) on the serving side simplifies deployment beyond the browser (wechat, react native, any future platforms)

dsmilkov

PTAL. I changed the string serialization format to align with TF C++. We now store the bytes of the string with 4 bytes denoting the string length.

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @caisq and @pyu10055)

caisq

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

32-bit integer

This is surprising. Why not a uint64? Is it because of JavaScript limitation?

Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.

python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...]

Does bytes of s* ... include the delimiter?

python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):

A np.array o

offset is not documented here.

dsmilkov

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)

python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…

32-bit integer 
This is surprising. Why not a uint64? Is it because of JavaScript limitation?

Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.

Talked offline. Yes, we can't represent uint64 in JS natively unless we implement our own IEEE-754. Changed to uint32 which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).

python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…

[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...] 
Does bytes of s* ... include the delimiter?

There is no longer a delimiter since the size of the string is explained by the byte length.

python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):

Previously, caisq (Shanqing Cai) wrote…

A np.array o 
offset is not documented here.

Thanks! Added.

dsmilkov

PTAL, changed to uint32. In tfjs-core, I will document the protocol in the WeightsManifest interface.

Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)

caisq

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Talked offline. Yes, we can't represent uint64 in JS natively unless we implement our own IEEE-754. Changed to uint32 which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).

SGTM.

python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

There is no longer a delimiter since the size of the string is explained by the byte length.

OK. Thanks for the explanation.

python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):

from tensorflowjs.read_weights import STRING_LENGTH_DTYPE

As per Google Python style guide, import the module instead, i.e.,

from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...

See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision

python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):

 

style nit: In-line comments should follow two spaces, not one. Same else where.

pyu10055

Thanks

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)

dsmilkov

Thanks for the reviews!

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @caisq and @pyu10055)

python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):

Previously, caisq (Shanqing Cai) wrote…

from tensorflowjs.read_weights import STRING_LENGTH_DTYPE 
As per Google Python style guide, import the module instead, i.e.,
from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...
See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision

Done.

python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):

Previously, caisq (Shanqing Cai) wrote…

 
style nit: In-line comments should follow two spaces, not one. Same else where.

Done.

FEATURE To align with TensorFlow Python/C++, this PR changes the way we serialize strings in the weights format, and in our engine. - We store the underlying encoded string bytes as `Uint8Array`. Thus a string tensors (which has multiple strings) is backed by `Uint8Array[]`. - To keep backwards compatibility, `tensor.data()` returns `string[]`, which means that we try to utf-8 decode a string. - In the weights format, string bytes are kept unchanged, with their original encoding. Each string is prefixed with 4 bytes denoting the number of bytes in the string. Thus, a string tensor of 3 values will be encoded as `[4 bytes][string1...][4 bytes][string2...][4 bytes][string3....]` - Add `util.encodeString(text: string, encoding?: string)` and `util.decodeString(bytes: Uint8Array, encoding?: string)`, along with the respective `Platform` methods - Add `tensor.bytes()` which gives the underlying bytes of the data - `Uint8Array` for any numeric tensor, and `Uint8Array[]` for string tensors. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386

Daniel Smilkov and others added 9 commits May 27, 2019 11:21

save

1634d0f

save

9014818

save

dc4946b

empty

720c4ea

empty

a01e081

empty

ace7496

empty

c2c9ca1

Merge remote-tracking branch 'origin' into string-weights

a3e9d1d

save

0a88f3f

googlebot added the cla: no label Jun 18, 2019

dsmilkov changed the title ~~WIP Add support for string weights~~ Add support for string weights Jun 18, 2019

dsmilkov requested review from caisq and pyu10055 June 18, 2019 13:38

tensorflow deleted a comment from googlebot Jun 18, 2019

dsmilkov mentioned this pull request Jun 18, 2019

Add support for string weights tensorflow/tfjs-core#1795

Merged

save

4e9c345

tensorflow deleted a comment from googlebot Jun 18, 2019

save

610a4e9

dsmilkov added 3 commits June 20, 2019 08:00

save

045cf0d

save

2042874

save

9484145

tensorflow deleted a comment from googlebot Jun 20, 2019

savetensorflow-hub

3e1853a

caisq suggested changes Jun 20, 2019

View reviewed changes

dsmilkov added 2 commits June 20, 2019 17:40

save

7cdc512

save

7636735

dsmilkov commented Jun 20, 2019

View reviewed changes

pyu10055 suggested changes Jun 20, 2019

View reviewed changes

caisq approved these changes Jun 21, 2019

View reviewed changes

dsmilkov commented Jun 21, 2019

View reviewed changes

dsmilkov added 2 commits June 25, 2019 22:19

save

f0c7f43

save

064b029

dsmilkov commented Jun 26, 2019

View reviewed changes

dsmilkov mentioned this pull request Jun 26, 2019

Serialize string tensors as encoded (raw bytes) tensorflow/tfjs-core#1816

Merged

save

81b886c

caisq suggested changes Jun 26, 2019

View reviewed changes

dsmilkov commented Jun 26, 2019

View reviewed changes

dsmilkov added 2 commits June 26, 2019 13:56

save

a2ff7d9

save

ceffa2b

caisq approved these changes Jun 26, 2019

View reviewed changes

pyu10055 approved these changes Jun 26, 2019

View reviewed changes

save

606bf70

dsmilkov commented Jun 27, 2019

View reviewed changes

dsmilkov merged commit 7ab02c7 into master Jun 27, 2019

dsmilkov deleted the string-weights branch June 27, 2019 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for string weights #386

Add support for string weights #386

Uh oh!

dsmilkov commented Jun 18, 2019 •

edited

Loading

dsmilkov commented Jun 18, 2019

dsmilkov commented Jun 20, 2019

googlebot commented Jun 20, 2019

caisq left a comment

dsmilkov left a comment

pyu10055 left a comment

caisq left a comment

dsmilkov left a comment

dsmilkov left a comment

caisq left a comment

dsmilkov left a comment

dsmilkov left a comment

caisq left a comment

pyu10055 left a comment

dsmilkov left a comment

Labels

5 participants

Add support for string weights #386

Add support for string weights #386

Uh oh!

Conversation

dsmilkov commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dsmilkov commented Jun 18, 2019

dsmilkov commented Jun 20, 2019

googlebot commented Jun 20, 2019

caisq left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

pyu10055 left a comment

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

pyu10055 left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

Labels

5 participants

dsmilkov commented Jun 18, 2019 •

edited

Loading