Skip to content

Conversation

@cushon
Copy link
Contributor

@cushon cushon commented Oct 29, 2025

This PR proposes adding a new overload to MemorySegment::getString that takes a known byte length of the content.

This was previously proposed in #20725, but the outcome of JDK-8333843 was to update MemorySegment#getString to suggest

 byte[] bytes = new byte[length]; MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length); return new String(bytes, charset); 

However this is less efficient than what the implementation of getString does after JDK-8362893, it now uses JavaLangAccess::uncheckedNewStringNoRepl to avoid the copy.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8369564: Provide a MemorySegment API to read strings with known lengths (Enhancement - P4)

Contributors

  • Per Minborg <pminborg@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043
$ git checkout pull/28043

Update a local copy of the PR:
$ git checkout pull/28043
$ git pull https://git.openjdk.org/jdk.git pull/28043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28043

View PR using the GUI difftool:
$ git pr show -t 28043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff

Using Webrev

Link to Webrev Comment

@cushon
Copy link
Contributor Author

cushon commented Oct 29, 2025

/contributor add @minborg

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 29, 2025

👋 Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 29, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Oct 29, 2025

@cushon
Contributor Per Minborg <pminborg@openjdk.org> successfully added.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Oct 29, 2025
@openjdk
Copy link

openjdk bot commented Oct 29, 2025

@cushon The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 29, 2025
@mlbridge
Copy link

mlbridge bot commented Oct 29, 2025

Webrevs

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments. I didn't look at the tests yet.

Comment on lines 64 to 66
case SINGLE_BYTE -> readByte(segment, offset, len, charset);
case DOUBLE_BYTE -> readShort(segment, offset, len, charset);
case QUAD_BYTE -> readInt(segment, offset, len, charset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 methods appear to be identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.

Comment on lines 1362 to 1363
String getString(long offset, int length, Charset charset);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1350 to 1351
* <li>{@code N} is the size (in bytes) of the terminator char according
* to the provided charset. For instance, this is 1 for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is not, I think this was left over from javadoc adapted from another overload

Comment on lines 1348 to 1349
* <li>{@code B} is the size, in bytes, of the string encoded using the
* provided charset (e.g. {@code str.getBytes(charset).length});</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't B equal to the length argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, I reworked this part

Comment on lines 1337 to 1338
* @param length byte length to be used for string conversion (not including any
* null termination)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').

Also, what happens if:

  • The length does include a null terminator
  • The length is not a multiple of the byte size of a character in the given charset.

On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.

Copy link
Contributor Author

@cushon cushon Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:

I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset

Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.

Copy link
Member

@JornVernee JornVernee Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org rfr Pull request is ready for review

2 participants