Skip to content

Conversation

@Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

When we have view scalars (utf8/binary) and we call to_array_of_size, the data buffers the resultant arrays have contains duplicate data. This is because the APIs we use don't deduplicate the data, instead appending it each time even though the data is exactly duplicated.

What changes are included in this PR?

Manually use a builder with deduplication enabled.

Are these changes tested?

Added test.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the common Related to common crate label Dec 21, 2025
Comment on lines +3031 to +3037
let mut builder =
StringViewBuilder::with_capacity(size).with_deduplicate_strings();
for _ in 0..size {
builder.append_value(value);
}
let array = builder.finish();
Arc::new(array)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could further optimize this by manually calling StringViewArray::try_new with correct data and avoid need to hash as part of the builder; felt that might get too into the weeds of arrow-rs code, so stuck with this simpler approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if we have a benchmark maybe it could be worth testing out if there is a performance gain by doing this? 🤔

let buffers = array.data_buffers();
assert_eq!(1, buffers.len());
// Ensure we only have a single copy of the value string
assert_eq!(value.len(), buffers[0].len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main this assert fails as the data buffer would be 10 * value.len(); with this fix we only have a single copy of the full string in the child buffer, minimizing memory footprint of the output array

Arc::new(StringViewArray::from_iter_values(repeat_n(value, size)))
let mut builder =
StringViewBuilder::with_capacity(size).with_deduplicate_strings();
for _ in 0..size {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 we should prob have some kind of append_n to remove this boilerplate, in some future for arrow-rs

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey this is LGTM

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to keep it simple now. I agree if we can add a new API in arrow-rs to let it directly construct StringView arrays with repeating element of size k, it's likely to be much faster. (should we open an issue?)

@Jefffrey
Copy link
Contributor Author

An append_n API sounds good, raised apache/arrow-rs#9034

@Jefffrey Jefffrey added this pull request to the merge queue Dec 23, 2025
Merged via the queue into apache:main with commit 677c543 Dec 23, 2025
27 checks passed
@Jefffrey Jefffrey deleted the optimize-view-to-arr branch December 23, 2025 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

4 participants