Skip to content

Conversation

emrys53
Copy link
Contributor

@emrys53 emrys53 commented Jul 1, 2025

While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.

I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation

Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.

This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.


// hmul
template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>
XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.

Copy link
Contributor Author

@emrys53 emrys53 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.

@DiamonDinoia
Copy link
Contributor

I'm also interested in this. @emrys53 could you rebase master?
@serge-sans-paille what do you think needs to be changes to integrate it?

serge-sans-paille added a commit that referenced this pull request Aug 20, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 20, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 21, 2025
It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.
@serge-sans-paille
Copy link
Contributor

@emrys53 I've submitted #1163 as a port of this patch for other targets, as well as some support for other architectures. I was strongly motivated and inspired by your work, so please don't take it as an offense, rather as a tribute!

serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.
serge-sans-paille added a commit that referenced this pull request Aug 23, 2025
This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.
@JohanMabille
Copy link
Member

Closing this in favor of #1163, thanks a lot @emrys53 for initiating the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants