feat: Add support for BinaryView and Utf8View types #320

GeorgeLeePatterson · 2025-10-31T20:14:29Z

What's Changed

This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling.

Implementation Details

Core Type Support

Added BinaryView and Utf8View type classes with view struct layout constants
Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24
Data class support for variadic buffer management

Visitor Pattern

Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data)
Set visitor: Marks as immutable (read-only)
VectorLoader: Reads from IPC format with variadicBufferCounts
TypeComparator, TypeCtor: Type system integration
JSON visitors
Builders

FlatBuffers

Generated schema files for BinaryView, Utf8View
Introduced scripts/update_flatbuffers.sh to regenerate from Arrow format definitions

What Works

Reading BinaryView/Utf8View columns from Arrow IPC as well as JSON
Accessing values with proper inline/out-of-line handling
Variadic buffer management
Type checking and comparison
BinaryView and Utf8View Builders

Testing

Unit tests for BinaryView and Utf8View
Tests verify both inline (≤12 bytes) and out-of-line data handling
TypeScript compiles without errors
All existing tests pass
Builders verified
Verified against DataFusion 50.0.3 integration, not included in this PR (enables native view types, removing need for configuration change in DataFusion's SessionConfig)

Future Work (Separate PRs)

~~Builders for write operations~~
ListView/LargeListView type implementation
~~Additional test coverage~~

Closes #311
Related to #225

This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225

src/vector.ts

GeorgeLeePatterson · 2025-10-31T20:42:15Z

@kylebarron Here is the first PR in a series I will post. Let me know if you have any feedback and I can address it as soon as I am able.

scripts/update_flatbuffers.sh

… from test tsconfig

Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.

This reverts commit dfe9d56.

Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh

Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.

Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.

src/visitor/get.ts

src/visitor/vectorloader.ts

src/data.ts

src/fb/SparseTensor.ts

trxcllnt

Made a few minor suggestions, but overall looks good.

I haven't kept up with the format additions the last few years, so I wasn't aware of the BinaryView/Utf8View memory layout -- it's clever, I like it. Thanks for contributing such a thoughtfully crafted implementation!

GeorgeLeePatterson · 2025-11-02T13:20:25Z

@trxcllnt Great news. The suggestions make perfect sense and are straight forward to implement. I imagine I made similar edits to the other 3 PRs, so I'll see where I can extend these changes into those PRs as well.

I'll have the changes up shortly.

- Simplify variadicBuffers byteLength calculation with reduce - Remove unsupported type enum entries (only add BinaryView and Utf8View) - Eliminate type casting by extracting getBinaryViewBytes helper - Simplify readVariadicBuffers with Array.from - Remove CompressedVectorLoader override (inherits base implementation) - Delete SparseTensor.ts (not implementing tensors in this PR)

GeorgeLeePatterson · 2025-11-04T15:40:13Z

I've identified some gaps on the Json side of the equation. I will work through them and post back when they are all resolved.

This commit implements complete JSON integration test support for BinaryView and Utf8View types by adding handling for variadic data buffers. Changes: - Updated buffersFromJSON() to handle VIEWS and VARIADIC_DATA_BUFFERS fields - Added variadicBufferCountsFromJSON() using reduce pattern to extract counts - Updated recordBatchFromJSON() to pass variadicBufferCounts to RecordBatch - Updated JSONVectorLoader constructor to accept and pass variadicBufferCounts - Updated RecordBatchJSONReaderImpl to pass variadicBufferCounts to loader

Implements viewDataFromJSON() to convert JSON view objects into 16-byte view structs required by the Arrow view format. The JSON VIEWS field contains objects with structure: - Inline views (≤12 bytes): {SIZE, INLINED} - Out-of-line views (>12 bytes): {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET} This function converts these to the binary view struct layout: [size: i32, prefix/inlined: 12 bytes, buffer_index: i32, offset: i32] Changes: - Added viewDataFromJSON() helper function - Updated JSONVectorLoader.readData() to handle BinaryView and Utf8View types - Properly constructs 16-byte view structs from JSON representation

…riter) Implements JSON writing for BinaryView and Utf8View types to enable 'JS producing' integration tests. This completes the JSON format support for view types. Implementation: - Added visitBinaryView() and visitUtf8View() methods to JSONVectorAssembler - Implemented viewDataToJSON() helper that converts 16-byte view structs to JSON - Handles both inline (≤12 bytes) and out-of-line (>12 bytes) views - Properly maps variadic buffer indices and converts buffers to hex strings JSON output format matches Apache Arrow spec: - Inline views: {SIZE, INLINED} where INLINED is hex (BinaryView) or string (Utf8View) - Out-of-line views: {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET} - VARIADIC_DATA_BUFFERS array contains hex-encoded buffer data This enables the complete roundtrip: Builder → Data → JSON → IPC → validation

…into feat/binary-utf8-view

This fixes integration test failures for BinaryView and Utf8View types. Changes: - Fix JSONTypeAssembler to serialize BinaryView/Utf8View type metadata - Fix JSONMessageReader to include VIEWS and VARIADIC_DATA_BUFFERS in sources - Fix viewDataFromJSON to handle both hex (BinaryView) and UTF-8 (Utf8View) INLINED formats - Fix readVariadicBuffers to handle individual hex strings correctly - Fix lint error: use String.fromCodePoint() instead of String.fromCharCode() - Fix lint error: use for-of loop instead of traditional for loop - Add comprehensive unit tests for JSON round-trip serialization Root cause: The JSON format uses different representations for inline data: - BinaryView INLINED: hex string (e.g., "48656C6C6F") - Utf8View INLINED: UTF-8 string (e.g., "Hello") The reader now auto-detects the format and handles both correctly. Fixes apache#320 integration test failures

- Extract hexStringToBytes() helper function to reduce code duplication - Update readVariadicBuffers() to use helper instead of wrapping in array - Update binaryDataFromJSON() to use helper for cleaner implementation - Add comprehensive documentation explaining design matches C++ reference - Document why 'as unknown as string' cast is necessary for heterogeneous sources array - Reference Arrow C++ implementation in comments for architectural clarity

When reading BinaryView/Utf8View data, ensure the DataView length doesn't exceed available buffer bounds. This fixes 'Invalid DataView length 16' errors that occur when the underlying buffer has less than 16 bytes available at the offset position. Fixes test failures in ES5 UMD build where view data deserialization was failing with RangeError.

Fixed critical bugs preventing BinaryView and Utf8View types from working correctly in ES5 UMD builds due to Google Closure Compiler advanced optimizations. Key fixes: 1. **Property access in data.ts** (visitBinaryView/visitUtf8View): - Changed from bracket notation (props['views']) to dot notation (props.views) - Closure Compiler was mangling property names when accessed via brackets - Dot notation allows consistent property renaming throughout compilation 2. **Property access in vectorloader.ts** (viewDataFromJSON): - Changed JSON property access from dot to bracket notation - Properties like SIZE, INLINED, PREFIX_HEX, etc. come from JSON - Must use bracket notation to access raw string keys from JSON 3. **Builder flush method** (BinaryViewBuilder/Utf8ViewBuilder): - Added this.clear() call at end of flush() to reset builder state - Matches pattern used by other builders (e.g., VariableWidthBuilder) - Fixes issue where multiple flush calls would accumulate length 4. **Buffer resize strategy**: - Changed from subarray() to slice() in resizeArray function - Creates copy instead of view to prevent issues with buffer reuse - Ensures flushed buffers are independent of builder state Results: - ✅ Builder pattern works correctly - ✅ vectorFromArray creates proper BinaryView/Utf8View vectors - ✅ JSON serialization/deserialization round-trips successfully - ✅ Multiple flush cycles work correctly Remaining test failures: - 2 integration tests fail only in ES5 UMD gulp tests - These tests call makeData() directly from test code with object literals - Property names get mangled differently between test code and library code - Same tests PASS with jest (no Closure Compiler involved) - All real-world usage patterns work correctly

The integration tests were calling makeData() directly from test code, which is incompatible with Google Closure Compiler's property name mangling in UMD builds. Changed tests to use vectorFromArray() which keeps all code within the same compilation unit. All unit tests now pass in all targets (ES5, ES2015, ESNext) and all module formats (CJS, ESM, UMD). Integration tests verified locally with archery and pass successfully.

GeorgeLeePatterson · 2025-11-05T02:45:20Z

After battling the integration tests for a bit I was finally able to resolve everything fully. I added unit test coverage to the JSON side as well as the existing IPC side.

I realized quickly that separating out the write/builder side of the equation wasn't going to work with how the integration tests run, so I ended up absorbing the follow up PR that adds builders. I'll decline that PR.

But both workflows are green with the patch applied. @kou have a look and let me know if you see anything else that needs to be addressed.

src/builder/buffer.ts

src/data.ts

…y access notation - Revert buffer.ts resizeArray() to use subarray() instead of slice() for performance - Fix data.ts visitUtf8View and visitBinaryView to use dot notation in destructuring for Closure Compiler compatibility

- Finishes implementation for Utf8View and BinaryView across JSON read/write paths - Patches bugs discovered from previous commits - Ensures property access is UMD friendly - Removes ad-hocs tests and incorporates new types into existing test infrastructure All passing locally: 1. Lint checks 2. Builds across all targets 3. All unit tests against all targets 4. All bundle tests 5. Integration tests

GeorgeLeePatterson · 2025-11-06T16:26:29Z

@kou @trxcllnt @kylebarron

It's been a long, arduous journey, but it's ready for another review. There are a number of improvements even over the last review. For example, I removed the useless ad hoc tests and have now incorporated the types into the existing testing framework. The CI is green, and all PR comments have been addressed. Of course if you have additional items you'd like to address, not a problem, but other than that, I believe it's ready.

kou

+1 for the CI related part and shell scripts.

scripts/update_flatbuffers.sh

kou · 2025-11-09T07:13:48Z

scripts/update_flatbuffers.sh

+rm -f "${TMPDIR}"/{File,Schema,Message,Tensor,SparseTensor}.fbs
+


It seems that we don't need this:

Suggested change

rm -f "${TMPDIR}"/{File,Schema,Message,Tensor,SparseTensor}.fbs

I made this change in response to a previous comment. I can add back.

We shouldn't ship these files if they don't correspond to anything in the Arrow implementation

Ah, the previous comment is #320 (comment), right?

I think that this line doesn't effect anything. I think that the shipped files are controlled by generated_files: https://github.com/apache/arrow-js/pull/320/files#diff-ec4452682eb9a047fa125ddc8a28a1c8c9b7d23a010c23fbb2e8ad7360526be8R58-R67
All other files including {File,Schema,Message,Tensor,SparseTensor}.fbs are removed on cleanup without shipping: https://github.com/apache/arrow-js/pull/320/files#diff-ec4452682eb9a047fa125ddc8a28a1c8c9b7d23a010c23fbb2e8ad7360526be8R39-R42

Understood, that makes sense. I removed it but have another PR coming with some updates from the most recent comment and the CI fixes. Piecing away at this beast! Once we get everything perfect, I can retrofit everything learned to all other new type PRs as well. I appreciate the feedback, should have a new PR up in the AM.

Oh, @trxcllnt I just noticed your comment. Did you want to change the PR to address that?

GeorgeLeePatterson · 2025-11-12T11:38:22Z

If you recommend updating any docs in response, let me know. flatc changed their format a bit since the last time the docs were written, but the script demonstrates that so it may not be required.

src/builder/utf8view.ts

GeorgeLeePatterson · 2025-11-13T21:36:58Z

@kou @trxcllnt @kylebarron Latest changes up, the CI is green, comments addressed. Have a look and lmk if anything else might need updating.

src/builder/utf8view.ts

Co-authored-by: Paul Taylor <178183+trxcllnt@users.noreply.github.com>

GeorgeLeePatterson added 2 commits October 29, 2025 19:28

WIP: add binaryview and uft8view support

a35ea1e

GeorgeLeePatterson force-pushed the feat/binary-utf8-view branch from 0e57715 to 5c5640a Compare October 31, 2025 20:17

GeorgeLeePatterson commented Oct 31, 2025

View reviewed changes

src/vector.ts Outdated Show resolved Hide resolved

kou requested a review from domoritz October 31, 2025 22:11

kou reviewed Oct 31, 2025

View reviewed changes

scripts/update_flatbuffers.sh Show resolved Hide resolved

kou requested a review from trxcllnt October 31, 2025 22:13

GeorgeLeePatterson added 2 commits October 31, 2025 20:12

Add Apache license headers to fix RAT check

675b2f2

Fix Jest dynamic import errors by removing moduleResolution: NodeNext…

73bda86

… from test tsconfig

GeorgeLeePatterson requested a review from kou November 1, 2025 14:24

GeorgeLeePatterson added 6 commits November 1, 2025 11:20

chore: Trigger CI validation on fork

456f85d

fix: Add new files to RAT exclusion list

dfe9d56

Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.

Revert "fix: Add new files to RAT exclusion list"

21a778f

This reverts commit dfe9d56.

fix: Correct license header format in update_flatbuffers.sh

e9d180b

Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh

fix: Export BinaryView and Utf8View types

8d5bf77

Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.

fix: Export BinaryView and Utf8View in Arrow.dom.ts

41f2d3e

Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.

This was referenced Nov 1, 2025

feat: Add BinaryViewBuilder and Utf8ViewBuilder for write support #322

Closed

feat: Add support for ListView and LargeListView types #323

Draft

feat: Add ListViewBuilder and LargeListViewBuilder for write support #324

Draft