Skip to content

Conversation

@supermacro
Copy link

@supermacro supermacro commented Oct 2, 2025

🤖 Generated with Claude Code


Summary

This PR implements full support for the LargeList data type in Apache Arrow JavaScript bindings. LargeList uses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB.

Related Issues

Closes #70

Implementation Details

Core Type System

  • Added LargeList enum value (Type.LargeList = 21)
  • Implemented LargeList<T> class with BigInt64Array offset support
  • Added isLargeList() type guard
  • Implemented LargeListDataProps interface for data construction

Visitor Pattern Implementation

Implemented visitLargeList() methods across all visitors:

  • GetVisitor (visitor/get.ts): Handles vector value retrieval with BigInt to number conversion
  • IteratorVisitor (visitor/iterator.ts): Enables iteration support via Symbol.iterator
  • IndexOfVisitor (visitor/indexof.ts): Supports indexOf() and includes() operations
  • TypeAssembler (visitor/typeassembler.ts): FlatBuffers serialization for IPC
  • JSONTypeAssembler (visitor/jsontypeassembler.ts): JSON serialization

IPC Support

  • Added LargeList case in ipc/metadata/message.ts for reading IPC metadata

Testing

  • Implemented comprehensive test data generation in test/generate-test-data.ts
    • Added generateLargeList() function with proper BigInt64Array offset generation
    • Fixed createVariableWidthOffsets64() to handle float-to-BigInt conversion
  • Added LargeList to test suite in test/unit/generated-data-tests.ts

Public API

  • Exported LargeList in src/Arrow.ts and src/Arrow.dom.ts

Test Plan

All existing tests continue to pass, plus new comprehensive LargeList tests validate:

  • ✅ Type creation and properties
  • ✅ Data construction with makeData()
  • ✅ Vector creation and value retrieval
  • ✅ Empty lists and null value handling
  • ✅ Iteration via [Symbol.iterator]
  • indexOf() and includes() operations
  • ✅ Vector slicing and concatenation

Run tests with:

npm test -- -t src

Checklist

  • Implementation follows existing code patterns
  • All visitor methods implemented
  • IPC serialization/deserialization support added
  • Comprehensive tests added using existing test framework
  • All tests passing
  • Public API exports added
  • No breaking changes

Notes

  • This implementation focuses on reading/writing LargeList data from IPC format
  • Builder class for LargeList construction is not included (can be added in future PR if needed)
  • Implementation mirrors existing List support but with 64-bit offsets
Implement full support for the LargeList data type, which uses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB. Changes include: - Add LargeList type class with BigInt64Array offset support - Implement visitor pattern methods across all visitors (get, iterator, indexof, typeassembler, jsontypeassembler) - Add IPC serialization/deserialization support - Add comprehensive test suite using existing test generation framework - Export LargeList in public API The implementation follows existing code patterns and all tests pass. Closes apache#70 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
@kou
Copy link
Member

kou commented Oct 2, 2025

Could you read https://www.apache.org/legal/generative-tooling.html carefully and share what you ensure?

@supermacro
Copy link
Author

I've read the doc and can assure that the contribution meets the guidelines. Thankfully I only used Anthropic's claude code.

Anthropic's Commercial Terms explicitly state: "Customer (a) retains all rights to its Inputs, and (b) owns its Outputs." Anthropic assigns all rights in outputs to the customer with no licensing restrictions on how outputs may be used or distributed.

As for point 2 in the doc, all three conditions are satisfied:

2.1 - Output is not copyrightable: The implementation is functional code that:

  • Implements a published specification (Apache Arrow LargeList format)
  • Follows mandatory patterns dictated by the existing arrow-js architecture
  • Contains minimal creative expression beyond what's required by the specification
  • Mirrors the existing List implementation with BigInt64Array instead of Int32Array

2.2 - No third party materials included: The code:

  • Does not contain copied code from external sources
  • Was generated based on the arrow-js codebase patterns (which are already Apache 2.0 licensed)
  • Implements only the Arrow specification

2.3 - Third party materials used with permission: N/A - no third party materials were included

.. For point 3. Reasonable certainty obtained through:

  • Pattern matching: The implementation exclusively follows patterns from the existing arrow-js codebase (e.g., generateList → generateLargeList, visitList → visitLargeList)
  • Specification-driven: Code implements the well-defined Arrow LargeList specification, leaving minimal room for copying external implementations
  • Testing: All tests pass using the project's existing test framework
  • Code review: The implementation can be verified to contain only specification-mandated logic and arrow-js patterns
@vimota
Copy link

vimota commented Oct 4, 2025

Agreed that based on the code changes, it's extremely unlikely to violate any of the guidelines set out in Apache's guidance - it conforms precisely to the existing codebase and adds very little "new" code/logic.

@kou kou requested review from Copilot, domoritz and trxcllnt October 4, 2025 06:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements full support for the LargeList data type in Apache Arrow JavaScript bindings. LargeList uses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB.

Key Changes:

  • Added LargeList type (enum value 21) with BigInt64Array offset support
  • Implemented visitor pattern methods across all visitor classes for LargeList support
  • Added IPC serialization/deserialization support for LargeList metadata
  • Enhanced test data generation with comprehensive LargeList test coverage

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/enum.ts Added LargeList = 21 enum value
src/type.ts Implemented LargeList class with BigInt64Array offsets and type guard
src/data.ts Added LargeListDataProps interface and makeData visitor support
src/visitor.ts Added visitLargeList method and type dispatch logic
src/visitor/get.ts Implemented getLargeList function with BigInt to number conversion
src/visitor/iterator.ts Added LargeList iterator support
src/visitor/indexof.ts Added LargeList indexOf operation support
src/visitor/typeassembler.ts Added FlatBuffers serialization for LargeList
src/visitor/jsontypeassembler.ts Added JSON serialization for LargeList
src/ipc/metadata/message.ts Added LargeList case in IPC metadata decoding
src/Arrow.ts Exported LargeList class
src/Arrow.dom.ts Exported LargeList class for DOM environments
test/generate-test-data.ts Added generateLargeList function and fixed BigInt conversion
test/unit/generated-data-tests.ts Added LargeList test suite

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

} else {
do {
offsets[i + 1] = offsets[i] + BigInt(Math.min(max, Math.max(min, Math.trunc(rand() * max))));
offsets[i + 1] = offsets[i] + BigInt(Math.trunc(Math.min(max, Math.max(min, Math.trunc(rand() * max)))));
Copy link

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested Math.trunc() calls are redundant. The outer Math.trunc() is unnecessary since Math.min() and Math.max() already return numbers, and the inner Math.trunc() handles the rand() * max conversion.

Suggested change
offsets[i + 1] = offsets[i] + BigInt(Math.trunc(Math.min(max, Math.max(min, Math.trunc(rand() * max)))));
offsets[i + 1] = offsets[i] + BigInt(Math.min(max, Math.max(min, Math.trunc(rand() * max))));
Copilot uses AI. Check for mistakes.
@supermacro
Copy link
Author

Hey @trxcllnt @domoritz any thoughts here, pretty small pr

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to try to reduce code size increases if possible. Otherwise looks reasonable. Please make sure the tests pass. Unit tests for specific logic would be great as well.

return { values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, child: childVec.data[0] })]) };
}

function generateLargeList<T extends LargeList>(this: TestDataVectorGenerator, type: T, length = 100, nullCount = Math.trunc(length * 0.2), child = this.visit(type.children[0].type, length * 3, nullCount * 3)): GeneratedVector<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have a lot of overlap with the method above. Can we extract the logic into a helper so we don't blow up code size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants