- Notifications
You must be signed in to change notification settings - Fork 6.1k
8369564: Provide a MemorySegment API to read strings with known lengths #28043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| /contributor add @minborg |
| 👋 Welcome back cushon! A progress list of the required criteria for merging this PR into |
| ❗ This change is not yet ready to be integrated. |
| @cushon |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments. I didn't look at the tests yet.
| case SINGLE_BYTE -> readByte(segment, offset, len, charset); | ||
| case DOUBLE_BYTE -> readShort(segment, offset, len, charset); | ||
| case QUAD_BYTE -> readInt(segment, offset, len, charset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 methods appear to be identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.
| String getString(long offset, int length, Charset charset); | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * <li>{@code N} is the size (in bytes) of the terminator char according | ||
| * to the provided charset. For instance, this is 1 for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it is not, I think this was left over from javadoc adapted from another overload
| * <li>{@code B} is the size, in bytes, of the string encoded using the | ||
| * provided charset (e.g. {@code str.getBytes(charset).length});</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't B equal to the length argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes, I reworked this part
| * @param length byte length to be used for string conversion (not including any | ||
| * null termination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').
Also, what happens if:
- The length does include a null terminator
- The length is not a multiple of the byte size of a character in the given charset.
On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:
I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset
Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.
Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar
That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.
Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.
Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar
That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.
Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.
This PR proposes adding a new overload to
MemorySegment::getStringthat takes a known byte length of the content.This was previously proposed in #20725, but the outcome of JDK-8333843 was to update
MemorySegment#getStringto suggestHowever this is less efficient than what the implementation of getString does after JDK-8362893, it now uses
JavaLangAccess::uncheckedNewStringNoReplto avoid the copy.Progress
Issue
Contributors
<pminborg@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043$ git checkout pull/28043Update a local copy of the PR:
$ git checkout pull/28043$ git pull https://git.openjdk.org/jdk.git pull/28043/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28043View PR using the GUI difftool:
$ git pr show -t 28043Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff
Using Webrev
Link to Webrev Comment