8369564: Provide a MemorySegment API to read strings with known lengths #28043

cushon · 2025-10-29T10:50:53Z

This PR proposes adding a new overload to MemorySegment::getString that takes a known byte length of the content.

This was previously proposed in #20725, but the outcome of JDK-8333843 was to update MemorySegment#getString to suggest

 byte[] bytes = new byte[length]; MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length); return new String(bytes, charset);

However this is less efficient than what the implementation of getString does after JDK-8362893, it now uses JavaLangAccess::uncheckedNewStringNoRepl to avoid the copy.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8369564: Provide a MemorySegment API to read strings with known lengths (Enhancement - P4)

Contributors

Per Minborg <pminborg@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043
$ git checkout pull/28043

Update a local copy of the PR:
$ git checkout pull/28043
$ git pull https://git.openjdk.org/jdk.git pull/28043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28043

View PR using the GUI difftool:
$ git pr show -t 28043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff

Using Webrev

Link to Webrev Comment

cushon · 2025-10-29T10:51:37Z

/contributor add @minborg

bridgekeeper · 2025-10-29T10:52:25Z

👋 Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-29T10:53:01Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-10-29T10:53:35Z

@cushon
Contributor Per Minborg <pminborg@openjdk.org> successfully added.

openjdk · 2025-10-29T10:54:11Z

@cushon The following label will be automatically applied to this pull request:

core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-10-29T10:58:01Z

Webrevs

JornVernee

Some preliminary comments. I didn't look at the tests yet.

JornVernee · 2025-10-29T15:29:34Z

src/java.base/share/classes/jdk/internal/foreign/StringSupport.java

+ case SINGLE_BYTE -> readByte(segment, offset, len, charset);
+ case DOUBLE_BYTE -> readShort(segment, offset, len, charset);
+ case QUAD_BYTE -> readInt(segment, offset, len, charset);


These 3 methods appear to be identical

Thanks, I refactored to do something more similar to the original PR to avoid the duplication here and with the existing read methods.

JornVernee · 2025-10-29T15:30:15Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+ String getString(long offset, int length, Charset charset);
+


I'd suggest putting the length parameter at the end, so that this becomes a telescoping overload of the length-less variant.

JornVernee · 2025-10-29T15:31:45Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+ * <li>{@code N} is the size (in bytes) of the terminator char according
+ * to the provided charset. For instance, this is 1 for


Why is the terminator char important? The segment doesn't necessarily need to have a terminator char, right? I don't see this invariant being checked in the code either.

Thanks, it is not, I think this was left over from javadoc adapted from another overload

JornVernee · 2025-10-29T15:32:40Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+ * <li>{@code B} is the size, in bytes, of the string encoded using the
+ * provided charset (e.g. {@code str.getBytes(charset).length});</li>


Isn't B equal to the length argument?

Thanks, yes, I reworked this part

JornVernee · 2025-10-29T15:35:41Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

+ * @param length byte length to be used for string conversion (not including any
+ * null termination)


I think 'to be used for string conversion' is a bit too vague (used how?). I think a more descriptive text could be something like 'length in bytes of the string to read' (matching also the pattern of the existing 'offset in bytes').

Also, what happens if:

The length does include a null terminator

The length is not a multiple of the byte size of a character in the given charset.

On that last note, I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset. That will make it impossible to pass a length that is not a multiple of the character size.

Thanks for taking a look, I wanted to respond briefly to this part and will review the rest of the comments later:

I wonder if this shouldn't be the length in bytes, but the length in characters. Then we can compute the byte length from the charset

Part of the motivation here is to support efficiently reading binary formats where I think it's more common to record the length of string data in bytes, than in 16-bit code units in the UTF-16 encoding of the string.

Discussed this with the team as well. For cases of native interop it seems more likely that you'd have e.g. an array of wchar_t on the native side, and you are tracking the length of that array, not the byte size.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar, but if the length is specified in bytes, the API has an extra error case we need to check and specify.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Either way, we felt that it would be a good idea if you could send an email to panama-dev in which you describe your exact use case, before getting further into the code review. That would give others a chance to respond with their use cases as well.

Sounds good, thanks, I can start a thread discussing the use-case here at a higher level.

A user can easily convert between one or the other length representation by multiplying/dividing by the right scalar

That is true of e.g. UTF-16 but not of UTF-8, since the encoding is variable width and doing the conversion from bytes to characters is more expensive there.

Sorry, I don't mean 'character' but 'code unit'. For instance, when reading a UTF-8 string, the unit would be one byte, for UTF-16 it would be two, for UTF-32 four. So a user would just need to divide by the unit size, at least that's the idea.

8369564: Provide a MemorySegment API to read strings with known lengths

937a868

openjdk bot added the core-libs core-libs-dev@openjdk.org label Oct 29, 2025

openjdk bot added the rfr Pull request is ready for review label Oct 29, 2025

JornVernee reviewed Oct 29, 2025

View reviewed changes

cushon added 2 commits October 29, 2025 17:33

Consolidate duplicate code in read methods

cd6db90

Update length to code units instead of bytes

43a719e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8369564: Provide a MemorySegment API to read strings with known lengths #28043

8369564: Provide a MemorySegment API to read strings with known lengths #28043

cushon commented Oct 29, 2025 •

edited by openjdk bot

Loading

cushon commented Oct 29, 2025

bridgekeeper bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025 •

edited

Loading

mlbridge bot commented Oct 29, 2025 •

edited

Loading

JornVernee left a comment

JornVernee Oct 29, 2025

cushon Oct 29, 2025

JornVernee Oct 29, 2025

cushon Oct 31, 2025

JornVernee Oct 29, 2025

cushon Oct 31, 2025

JornVernee Oct 29, 2025

cushon Oct 31, 2025

JornVernee Oct 29, 2025

cushon Oct 29, 2025 •

edited

Loading

JornVernee Oct 29, 2025

cushon Oct 29, 2025

JornVernee Oct 29, 2025 •

edited

Loading

Labels

2 participants

		* <li>{@code N} is the size (in bytes) of the terminator char according
		* to the provided charset. For instance, this is 1 for

		* <li>{@code B} is the size, in bytes, of the string encoded using the
		* provided charset (e.g. {@code str.getBytes(charset).length});</li>

		* @param length byte length to be used for string conversion (not including any
		* null termination)

8369564: Provide a MemorySegment API to read strings with known lengths #28043

Are you sure you want to change the base?

8369564: Provide a MemorySegment API to read strings with known lengths #28043

Conversation

cushon commented Oct 29, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Contributors

Reviewing

cushon commented Oct 29, 2025

bridgekeeper bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025

openjdk bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mlbridge bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

JornVernee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cushon Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JornVernee Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Labels

2 participants

cushon commented Oct 29, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 29, 2025 •

edited

Loading

mlbridge bot commented Oct 29, 2025 •

edited

Loading

cushon Oct 29, 2025 •

edited

Loading

JornVernee Oct 29, 2025 •

edited

Loading