- Notifications
You must be signed in to change notification settings - Fork 143
Add support for string weights #386
Conversation
| CLA is ok since I'm a Googler (the confusion is because I didn't use email address for the first 3 commits - they were made from a new laptop) |
FEATURE - Add support for deserializing weights of dtype `string` - Allow string tensors in `tile` and `expandDims` ops (these ops are used by AutoML) - Add utf-8 encoding/decoding capability to `Platform` This is needed to enable execution of AutoML models, which store the vocab as a weight (`Const` tensor) of dtype `string`. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386
| @pyu10055 This is ready for review. Added the |
| We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
caisq left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):
np.object:
Can we check weight['dtype'] instead? np.object is an implementation detail.
python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):
[['test', 'a'], ['b', 'c']],
Same as below. Add test for empty strings and empty string tensors (if supported).
python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):
'test'
Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?
python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):
def _serialize_string_array(data, delimiter):
Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A
python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):
unicode_strings
Variable name nit: --> decoded_strings
python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):
delimiter.join(unicode_strings)
Should we check if unicode_strings contain delimiter and throw an error if that's the case?
python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):
STRING_DELIMITER
So we are not opening the option for the user to specify the delimiter yet?
python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):
can_skip_weight = (skip_op_check and
Why can this code be deleted without breaking any features?
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good review! PTAL.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)
python/tensorflowjs/read_weights.py, line 149 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
np.object:Can we check weight['dtype'] instead? np.object is an implementation detail.
Done.
python/tensorflowjs/read_weights_test.py, line 65 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
[['test', 'a'], ['b', 'c']],Same as below. Add test for empty strings and empty string tensors (if supported).
Done. Added a test below.
python/tensorflowjs/read_weights_test.py, line 162 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
'test'Can you add tests for 1) empty strings in a non-empty string tensor and 2) empty string tensors? Is 2 supported?
Done.
python/tensorflowjs/write_weights.py, line 181 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
def _serialize_string_array(data, delimiter):Can you add a doc string to document what the types the arguments are expected to be? Also document the return type. Same for the function below. A
Done.
python/tensorflowjs/write_weights.py, line 193 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
unicode_stringsVariable name nit: --> decoded_strings
Done.
python/tensorflowjs/write_weights.py, line 197 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
delimiter.join(unicode_strings)Should we check if unicode_strings contain delimiter and throw an error if that's the case?
Done. Added a unit test.
python/tensorflowjs/write_weights.py, line 295 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
STRING_DELIMITERSo we are not opening the option for the user to specify the delimiter yet?
Correct. Hopefully there is no need to change this delimiter ever, but it's good to have the escape hatch for later.
python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
can_skip_weight = (skip_op_check andWhy can this code be deleted without breaking any features?
Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.
pyu10055 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, 1 of 5 files at r3.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):
object
should be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?
caisq left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq and @dsmilkov)
python/tensorflowjs/converters/tf_saved_model_conversion_v2.py, line 197 at r4 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
Because we used to skip adding string weights in the manifest, and now we don't need to skip them anymore since we support them.
Ah - OK. I should have realized that.
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
python/tensorflowjs/read_weights.py, line 27 at r5 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
objectshould be store the encoding of the string tensor, so after conversion and loaded on the js side, it can be encoded to the correct encoding?
Good question, but I didn't see any specific need for that since utf-8 can encode any unicode code point. Using a single type of encoding (utf-8) on the serving side simplifies deployment beyond the browser (wechat, react native, any future platforms)
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL. I changed the string serialization format to align with TF C++. We now store the bytes of the string with 4 bytes denoting the string length.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq and @pyu10055)
caisq left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):
32-bit integer
This is surprising. Why not a uint64? Is it because of JavaScript limitation?
Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.
python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):
[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...]
Does bytes of s* ... include the delimiter?
python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):
A np.array o
offset is not documented here.
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)
python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):
Previously, caisq (Shanqing Cai) wrote…
32-bit integerThis is surprising. Why not a uint64? Is it because of JavaScript limitation?
Keep in mind that the format will likely get used as a shared format between Keras and TF.js, so to the degree possible, we should remove JavaScript-specific workarounds from it.
Talked offline. Yes, we can't represent uint64 in JS natively unless we implement our own IEEE-754. Changed to uint32 which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).
python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):
Previously, caisq (Shanqing Cai) wrote…
[byte length of s1][bytes of s1...][byte length of s2][bytes of s2...]Does
bytes of s* ...include the delimiter?
There is no longer a delimiter since the size of the string is explained by the byte length.
python/tensorflowjs/read_weights.py, line 92 at r6 (raw file):
Previously, caisq (Shanqing Cai) wrote…
A np.array o
offsetis not documented here.
Thanks! Added.
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, changed to uint32. In tfjs-core, I will document the protocol in the WeightsManifest interface.
Reviewable status: 0 of 1 approvals obtained (waiting on @caisq and @pyu10055)
caisq left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
python/tensorflowjs/read_weights.py, line 76 at r6 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
Talked offline. Yes, we can't represent
uint64in JS natively unless we implement our ownIEEE-754. Changed touint32which is natively supported in JS, and gives us strings with up to 4B characters of length (4GB of ascii text).
SGTM.
python/tensorflowjs/read_weights.py, line 82 at r6 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
There is no longer a delimiter since the size of the string is explained by the byte length.
OK. Thanks for the explanation.
python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):
from tensorflowjs.read_weights import STRING_LENGTH_DTYPE
As per Google Python style guide, import the module instead, i.e.,
from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision
python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):
style nit: In-line comments should follow two spaces, not one. Same else where.
pyu10055 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
dsmilkov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @caisq and @pyu10055)
python/tensorflowjs/write_weights.py, line 24 at r7 (raw file):
Previously, caisq (Shanqing Cai) wrote…
from tensorflowjs.read_weights import STRING_LENGTH_DTYPEAs per Google Python style guide, import the module instead, i.e.,
from tensorflowjs import read_weights ... read_weights.STRING_LENGTH_DTYPE ...See https://github.com/google/styleguide/blob/gh-pages/pyguide.md#224-decision
Done.
python/tensorflowjs/write_weights_test.py, line 132 at r7 (raw file):
Previously, caisq (Shanqing Cai) wrote…
style nit: In-line comments should follow two spaces, not one. Same else where.
Done.
FEATURE To align with TensorFlow Python/C++, this PR changes the way we serialize strings in the weights format, and in our engine. - We store the underlying encoded string bytes as `Uint8Array`. Thus a string tensors (which has multiple strings) is backed by `Uint8Array[]`. - To keep backwards compatibility, `tensor.data()` returns `string[]`, which means that we try to utf-8 decode a string. - In the weights format, string bytes are kept unchanged, with their original encoding. Each string is prefixed with 4 bytes denoting the number of bytes in the string. Thus, a string tensor of 3 values will be encoded as `[4 bytes][string1...][4 bytes][string2...][4 bytes][string3....]` - Add `util.encodeString(text: string, encoding?: string)` and `util.decodeString(bytes: Uint8Array, encoding?: string)`, along with the respective `Platform` methods - Add `tensor.bytes()` which gives the underlying bytes of the data - `Uint8Array` for any numeric tensor, and `Uint8Array[]` for string tensors. Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386
Allow serialization and deserialization of string weights in the binary weights format.
This is needed to enable execution of AutoML models, which store the vocab as a weight (
Constof dtypestring).Corresponding change in tfjs-core: tensorflow/tfjs-core#1816
Fixes tensorflow/tfjs#1598
This change is