Skip to content

Conversation

@laurit
Copy link
Contributor

@laurit laurit commented May 6, 2024

To marshal strings without allocating extra memory we need to implement our own methods for estimating the byte size of utf-8 encoded string and encoding the string to utf8. Unfortunately jit compiler isn't able to optimize these too well. This PR explores using unsafe to speed up utf8 encoding. The focus is on strings that use only latin1 and are stored inside a byte array in the java.lang.String class since jdk9. Unsafe is used to access that byte array. Based on the byte array we determine whether the string is composed only of 7-bit characters, if so we can copy they byte array directly to output as no additional encoding is needed. To determine whether the string only has 7-bit characters we count bytes that have negative value, the count of such bytes corresponds to characters that would be encoded as 2 bytes in utf8. Unsafe is used to let this loop process 8 bytes at a time.

Benchmark (stringSize) Mode Cnt Score Error Units StringMarshalBenchmark.marshalAsciiStringStateful 512 avgt 10 0.032 ± 0.001 us/op StringMarshalBenchmark.marshalAsciiStringStateful:gc.alloc.rate 512 avgt 10 17387.600 ± 362.188 MB/sec StringMarshalBenchmark.marshalAsciiStringStateful:gc.alloc.rate.norm 512 avgt 10 576.000 ± 0.001 B/op StringMarshalBenchmark.marshalAsciiStringStateful:gc.count 512 avgt 10 270.000 counts StringMarshalBenchmark.marshalAsciiStringStateful:gc.time 512 avgt 10 99.000 ms StringMarshalBenchmark.marshalAsciiStringStatelessSafe 512 avgt 10 0.363 ± 0.008 us/op StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.alloc.rate 512 avgt 10 0.006 ± 0.001 MB/sec StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.alloc.rate.norm 512 avgt 10 0.002 ± 0.001 B/op StringMarshalBenchmark.marshalAsciiStringStatelessSafe:gc.count 512 avgt 10 ≈ 0 counts StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe 512 avgt 10 0.038 ± 0.001 us/op StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.alloc.rate 512 avgt 10 596.035 ± 7.136 MB/sec StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.alloc.rate.norm 512 avgt 10 24.000 ± 0.001 B/op StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.count 512 avgt 10 18.000 counts StringMarshalBenchmark.marshalAsciiStringStatelessUnsafe:gc.time 512 avgt 10 10.000 ms StringMarshalBenchmark.marshalLatin1StringStateful 512 avgt 10 0.214 ± 0.001 us/op StringMarshalBenchmark.marshalLatin1StringStateful:gc.alloc.rate 512 avgt 10 4837.182 ± 14.924 MB/sec StringMarshalBenchmark.marshalLatin1StringStateful:gc.alloc.rate.norm 512 avgt 10 1088.001 ± 0.001 B/op StringMarshalBenchmark.marshalLatin1StringStateful:gc.count 512 avgt 10 145.000 counts StringMarshalBenchmark.marshalLatin1StringStateful:gc.time 512 avgt 10 47.000 ms StringMarshalBenchmark.marshalLatin1StringStatelessSafe 512 avgt 10 0.817 ± 0.003 us/op StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.alloc.rate 512 avgt 10 0.006 ± 0.001 MB/sec StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.alloc.rate.norm 512 avgt 10 0.005 ± 0.001 B/op StringMarshalBenchmark.marshalLatin1StringStatelessSafe:gc.count 512 avgt 10 ≈ 0 counts StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe 512 avgt 10 0.555 ± 0.002 us/op StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.alloc.rate 512 avgt 10 0.006 ± 0.001 MB/sec StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.alloc.rate.norm 512 avgt 10 0.003 ± 0.001 B/op StringMarshalBenchmark.marshalLatin1StringStatelessUnsafe:gc.count 512 avgt 10 ≈ 0 counts StringMarshalBenchmark.marshalUnicodeStringStateful 512 avgt 10 0.502 ± 0.004 us/op StringMarshalBenchmark.marshalUnicodeStringStateful:gc.alloc.rate 512 avgt 10 3037.872 ± 25.989 MB/sec StringMarshalBenchmark.marshalUnicodeStringStateful:gc.alloc.rate.norm 512 avgt 10 1600.003 ± 0.001 B/op StringMarshalBenchmark.marshalUnicodeStringStateful:gc.count 512 avgt 10 91.000 counts StringMarshalBenchmark.marshalUnicodeStringStateful:gc.time 512 avgt 10 33.000 ms StringMarshalBenchmark.marshalUnicodeStringStatelessSafe 512 avgt 10 0.784 ± 0.021 us/op StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.alloc.rate 512 avgt 10 0.006 ± 0.001 MB/sec StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.alloc.rate.norm 512 avgt 10 0.005 ± 0.001 B/op StringMarshalBenchmark.marshalUnicodeStringStatelessSafe:gc.count 512 avgt 10 ≈ 0 counts StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe 512 avgt 10 0.770 ± 0.029 us/op StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.alloc.rate 512 avgt 10 0.006 ± 0.001 MB/sec StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.alloc.rate.norm 512 avgt 10 0.005 ± 0.001 B/op StringMarshalBenchmark.marshalUnicodeStringStatelessUnsafe:gc.count 512 avgt 10 ≈ 0 counts 
@laurit laurit requested a review from a team May 6, 2024 08:19
@codecov
Copy link

codecov bot commented May 6, 2024

Codecov Report

Attention: Patch coverage is 77.27273% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 90.83%. Comparing base (a764419) to head (eb1cc64).
Report is 9 commits behind head on main.

Current head eb1cc64 differs from pull request most recent head 43c82d6

Please upload reports for the commit 43c82d6 to get more accurate results.

Files Patch % Lines
...emetry/exporter/internal/marshal/UnsafeString.java 56.25% 4 Missing and 3 partials ⚠️
...emetry/exporter/internal/marshal/UnsafeAccess.java 72.22% 4 Missing and 1 partial ⚠️
...orter/internal/marshal/StatelessMarshalerUtil.java 89.28% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@ Coverage Diff @@ ## main #6433 +/- ## ============================================ - Coverage 90.87% 90.83% -0.05%  - Complexity 6169 6197 +28  ============================================ Files 678 680 +2 Lines 18507 18568 +61 Branches 1818 1829 +11 ============================================ + Hits 16819 16866 +47  - Misses 1153 1160 +7  - Partials 535 542 +7 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jkwatson
Copy link
Contributor

jkwatson commented May 7, 2024

What will happen with java 22, when Unsafe is deprecated?

@laurit
Copy link
Contributor Author

laurit commented May 8, 2024

What will happen with java 22, when Unsafe is deprecated?

Looking at https://openjdk.org/jeps/471 I'd say nothing will change before jdk 25. As unsafe usage is optional even then not much will change. Eventually users may need to opt in by allowing access to private members of java.lang.String so we could read the underlying byte array. Java agent can probably get access without requiring users to do anything. In coming jdk releases using vector api will allow us to further improve the performance.

@jkwatson
Copy link
Contributor

jkwatson commented May 8, 2024

Eventually users may need to opt in by allowing access to private members of java.lang.String so we could read the underlying byte array.

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

@trask
Copy link
Member

trask commented May 8, 2024

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

I believe the optimization only takes effect if Unsafe is available, @laurit can you clarify?

@laurit
Copy link
Contributor Author

laurit commented May 9, 2024

Yes the intent is to use this optimization only when unsafe is available and has the required methods, otherwise just fall back to the version that doesn't require unsafe. If we need to further future proof it we could consider limiting it by jvm version. If 21 is used we'd allow the optimization, but disallow it for future versions like 23 for which we don't know what is going to happen. When 23 is release we could verify that it works and enable the optimization for that version too. @jkwatson would you be more comfortable with something like this?

We definitely would not want to force people to do any additional work to use the SDK out of the box. For that reason, I'm a bit nervous about making this optimization.

Given the way jdk authors have been trying to lock down access to jdk internals since jdk 9 at one point user code won't be able to access private fields in jdk classes unless user explicitly allows it with --add-opens=java.base/java.lang=ALL-UNNAMED. By the time this happens perhaps there are new apis or ideas or advances in the jit compiler that make using unsafe unnecessary.

@jhalliday
Copy link
Contributor

Interesting idea! I'll skip the obvious (Unsafe is unsafe, and has a limited shelf life at this point) and get right into the gritty details:

If I've understood it right, there are at least three distinct things here, which are conflated in the presentation of the benchmark data and which may be interesting to consider separately:

  • Stateful (the Marshaler is a one-shot object wrapping the thing to be marshalled, which is how all the others work?) vs Stateless (more c-style, the thing to be marshalled is passed as an argument). That seems an invasive change, in that any user of the Marshalers has to handle the stateless one differently. Presumably it's aimed at cutting object allocation, but that's distinct from just using Unsafe. Indeed, it's making life harder - the safe/unsafe switch could otherwise be an encapsulated implementation detail of the marshaler. Usability concerns aside, it's also I think the likely culprit causing perf regression in the Safe case, in that size now has to be recalculated instead of cached in marshaler instance. That seems especially painful for the common ascii case - 0.032 us/op growing to 0.363 is hard to ignore, even if it may be offset by gc savings.

  • The test string length has been changed from 16 to 512. Are either of those values based on observed data in production? If not, parametrising the benchmark over a range of values would be interesting. In particular, I'd like to see the numbers for the original 16 length.

  • The batching of bytes into a long feels overly complex. When doing compact strings in the JDK they went with a simple byte at a time loop, presumably relying on the compiler to unroll it and the CPU to pipeline it. Whilst I'm not keen on a combinatorial explosion of benchmark parameters, I'd be more comfortable with the extra code complexity here if it was backed by some evidence it's actually worthwhile.

  • StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious. What am I missing/

@laurit
Copy link
Contributor Author

laurit commented May 16, 2024

That seems an invasive change, in that any user of the Marshalers has to handle the stateless one differently.

These changes are already done and merged. I think they are out of the scope of this PR.

The test string length has been changed from 16 to 512. Are either of those values based on observed data in production?

They aren't, they were copied from the existing span marshaling benchmark that creates test data with either 16 or 512 spans. This benchmark was added in a previous PR that introduced stateless marshalers. I think I removed 16 just to make the benchmark run faster.

The batching of bytes into a long feels overly complex. When doing compact strings in the JDK they went with a simple byte at a time loop, presumably relying on the compiler to unroll it and the CPU to pipeline it.

If you look at the jdk code https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/java.base/share/classes/java/lang/StringCoding.java#L37 you'll see that the method with the simple loop is using @IntrinsicCandidate annotation the actual implementation https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L3462 is also quite complicated. Unfortunately jit compiler doesn't generate the fastest code for the simple loop. Hotspot jit vectorizes loops where you add 2 vectors, but as far as I know it does not vectorize loops that have a reduction (for example compute dot product). In future jdk version vector api could allow faster implementation for this method that does not use unsafe.

StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious.

The 24 bytes should come from writeBinaryTo method that allocates a new ProtoSerializer. My guess is that JIT compiler manages to stack allocate ProtoSerializer. JIT compiler works in mysterious ways.

@jack-berg
Copy link
Member

That seems especially painful for the common ascii case - 0.032 us/op growing to 0.363 is hard to ignore, even if it may be offset by gc savings.

My understanding is that there are two dimensions. 1. Whether or not the stateless marshaling is enabled, which is triggered by setting OTEL_JAVA_EXPERIMENTAL_EXPORTER_MEMORY_MODE=reusable_data. 2. Whether or not unsafe is available. These two dimensions result in the following perf figures for the ascii case:

Stateful Stateless
Safe 0.032 us/op
576 B/op
0.363 us/op
.002 B/op
Unsafe n/a 0.038 us/op
24 B/op

So you only get the 0.032 us/op case if you opt in reusable_data memory mode and unsafe is not available, which seems like it should be an unusual combination. Although it will become more common when we make reusable_data mode the default.

@laurit any indea why the ascii stateless/safe scenario seems to be so much worse that the latin1 or unicode equivalent?

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments but I'm in favor of this. I get that unsafe is unsafe, but:

  • We already have precedent for using it with BatchSpanProcessor
  • This is special case where we can benefit quite a bit from its usage
  • The usage is limited to a small area of code which is well defined / contained, with fallback logic when unsafe is not available
}

/** Returns the count of bytes with negative value. */
private static int countNegative(byte[] bytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a bit to parse what's going on here but I think I get it:

  • We want to use unsafe to read the value of a string's internal byte[] directly when a string is composed of all latin1 characters that fit within 7 bits (first 128 code points)
  • In order to determine if a string meets this condition as fast as possible, we use unsafe and some clever (seriously clever - where did you come up with this??) shifting to read through the string in chunks of 8 bytes, counting the number of times when a byte has a 1 in the most significant bit and therefore doesn't fit in 7 bits. Reading 8 bytes at a time is faster than reading 1 byte at a time, which is what we do currently when trying to compute the size of a string with reusable_data memory mode.
  • Later, when we are serializing the string, we check if the string is latin1 and had a size which was the same as the string length. If true, we can use unsafe to write the string's internal byte[] to the output stream. If false, then fallback to the existing serialization logic.

Does this sound like a correct characterization @laurit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sound like a correct characterization @laurit?

yes

We want to use unsafe to read the value of a string's internal byte[] directly when a string is composed of all latin1 characters that fit within 7 bits (first 128 code points)

when we read they byte array we don't know whether it is going to be 7bit or not

where did you come up with this?

Typically such loops are sped up by using vector instructions that allow processing multiple bytes simultaneously. Processing bytes on long a time is similar to using vector instructions. It would be interesting to see what effect there would be when this method is written using the vector api preview feature in recent jdks.

If true, we can use unsafe to write the string's internal byte[] to the output stream

That part doesn't use unsafe, just plain System.arraycopy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically such loops are sped up by using vector instructions that allow processing multiple bytes simultaneously. Processing bytes on long a time is similar to using vector instructions. It would be interesting to see what effect there would be when this method is written using the vector api preview feature in recent jdks.

Fascinating!

api(project(":sdk-extensions:autoconfigure-spi"))

compileOnly(project(":sdk:common"))
compileOnly(project(":exporters:common:compile-stub"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if you can explain why a separate module is needed for this, even tho its not published.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to compile a class that uses Unsafe with javac from jdk 17 targeting jdk 8 it fails with

javac --release 8 Test.java Test.java:3: error: package sun.misc does not exist	System.err.println(sun.misc.Unsafe.class.getName()); 

curiously when targeting jdk 11 it gets a warning instead

javac --release 11 Test.java Test.java:3: warning: Unsafe is internal proprietary API and may be removed in a future release	System.err.println(sun.misc.Unsafe.class.getName()); 

Not using --release 8 also results in a warning

javac --source 8 --target 8 Test.java warning: [options] bootstrap class path not set in conjunction with -source 8 Test.java:3: warning: Unsafe is internal proprietary API and may be removed in a future release	System.err.println(sun.misc.Unsafe.class.getName()); 

Creating a copy of Unsafe just for compiling felt like the easiest way to make this work. There is a comment in our version of Unsafe

/** * sun.misc.Unsafe from the JDK isn't found by the compiler, we provide out own trimmed down version * that we can compile against. */ 

I could add a similar comment here also. Is this what you were looking for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be helpful.

@jhalliday
Copy link
Contributor

reusable_data memory mode and unsafe is not available... will become more common when we make reusable_data mode the default.

yes, my concern is the future point where unsafe is retired or discouraged in the JDK and reusable_data is normal. The gc saving is nice if gc is a problem e.g. single core containers, but where gc is low cost e.g. enough spare cores to run it in the background, then the added CPU cost of the stateless serialization on the critical path by default may be unwelcome. As long as stateful is still available as an option it's not a big deal, the default can be tweaked fairly easily.

@jhalliday
Copy link
Contributor

StatelessUnsafe appears to be allocating in the acsii case, but not for latin1 or unicode. That feels somewhat suspicious.

The 24 bytes should come from writeBinaryTo method that allocates a new ProtoSerializer. My guess is that JIT compiler manages to stack allocate ProtoSerializer. JIT compiler works in mysterious ways.

perhaps, but I'm not clear what's allowing it to elide the allocation in some cases but not others. If the code was polymorphic on the ascii vs latin1/unicode choice then it could be optimizing the two functions differently, but I don't see where it is. Log the compiler execution trace perhaps? It's not bad, it's just annoyingly inexplicable.

@laurit
Copy link
Contributor Author

laurit commented May 24, 2024

@laurit any idea why the ascii stateless/safe scenario seems to be so much worse that the latin1 or unicode equivalent?

My understand is that this is because JDK optimizes for the ascii case. See https://github.com/openjdk/jdk17/blob/4afbcaf55383ec2f5da53282a1547bac3d099e9d/src/java.base/share/classes/java/lang/String.java#L1264

 if (!StringCoding.hasNegatives(val, 0, val.length)) return Arrays.copyOf(val, val.length);

The StringCoding.hasNegatives is annotated with @IntrinsicCandidate and gets replaced with https://github.com/openjdk/jdk17u/blob/68caeca1a5f265ca30a7e6d2a62127d458a3473e/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L3462

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment about providing a property for explicit opt out. Seems like a conservative way to protect against any user concerns or issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants