- Notifications
You must be signed in to change notification settings - Fork 280
Description
Performance Issue: Inefficient get(0)
Implementation
Summary
The current implementation of get()
for batch<T, A>
always stores the entire batch into an aligned buffer and returns buffer[I]
, even for I == 0
. This introduces unnecessary overhead when only the first element is needed, which is common in reduction operations.
template <class A, size_t I, class T> XSIMD_INLINE T get(batch<T, A> const& self, ::xsimd::index<I>, requires_arch<common>) noexcept { alignas(A::alignment()) T buffer[batch<T, A>::size]; self.store_aligned(&buffer[0]); return buffer[I]; }
Problem
Accessing the first element (get(0)
) via full store_aligned
is much more expensive than necessary. In reduce function, self.get(0) in the end which adds unneccassary cost. If we are loading the batch in a buffer, the performance benefit of using reduce function disappears as we can just load everything to buffer and then implement in a scalar fashion. The entire purpose of reduction operations are to avoid directly copying to data to a buffer.---
Proposed Solution
Introduce a first()
helper for efficiently accessing the first lane of a batch:
template <class T, class A> XSIMD_INLINE T first( batch<T, A> const& self) noexcept { // Example: platform-specific optimized intrinsic return self.get_first(); // or use appropriate intrinsic depending on A }
This could avoid the store_aligned() and instead use more efficient intrinsics like:
_mm_cvtsd_f64() (SSE2) _mm256_castps256_ps128() + _mm_cvtss_f32() (AVX) _mm512_cvtss_f32() (AVX512)
This would dramatically improve performance for reductions and any other first-element access patterns.
This would eliminate the cost of storing the entire batch just to access the first element.