Skip to content

Conversation

@jadewang-db
Copy link
Contributor

Summary

This PR adds support for accessing raw Arrow IPC (Inter-Process Communication) streams
directly, providing an alternative to the existing parsed Arrow record interface. This
allows users to work with the Arrow binary format without the overhead of parsing,
enabling use cases like streaming data to other systems or custom processing pipelines.

Changes

  • Refactored internal batch iterator to expose IPC streams:

    • Renamed BatchIteratorIPCStreamIterator (returns io.Reader)
    • Updated implementations to return raw IPC streams instead of parsed batches
    • Created backward-compatible BatchIterator wrapper
  • Added public API in rows package:

    • New GetArrowIPCStreams() method on Rows interface
    • New ArrowIPCStreamIterator interface with methods:
      • Next() (io.Reader, error) - returns raw IPC stream
      • HasNext() bool - Close() - SchemaBytes() ([]byte, error)
  • Updated all row scanner implementations to support IPC streams

  • Added example demonstrating IPC stream usage

Benefits

  • Performance: Avoid parsing overhead when forwarding Arrow data
  • Flexibility: Direct access to Arrow binary format for custom processing
  • Compatibility: Easier integration with other Arrow-based systems
  • Memory efficiency: Process streams without loading all records into memory

Testing

  • All existing tests pass
  • Backward compatibility maintained through wrapper pattern
  • Example provided in examples/ipcstreams/

Usage Example

rows, err := db.QueryContext(ctx, "SELECT * FROM table") ipcStreams, err := rows.(dbsqlrows.Rows).GetArrowIPCStreams(ctx) defer ipcStreams.Close() for ipcStreams.HasNext() { reader, err := ipcStreams.Next() // Process raw Arrow IPC stream }
jackyhu-db
jackyhu-db previously approved these changes Jul 23, 2025
… format ## Summary This PR adds support for accessing raw Arrow IPC (Inter-Process Communication) streams directly, providing an alternative to the existing parsed Arrow record interface. This allows users to work with the Arrow binary format without the overhead of parsing, enabling use cases like streaming data to other systems or custom processing pipelines. ## Changes - **Refactored internal batch iterator** to expose IPC streams: - Renamed `BatchIterator` → `IPCStreamIterator` (returns `io.Reader`) - Updated implementations to return raw IPC streams instead of parsed batches - Created backward-compatible `BatchIterator` wrapper - **Added public API** in `rows` package: - New `GetArrowIPCStreams()` method on `Rows` interface - New `ArrowIPCStreamIterator` interface with methods: - `Next() (io.Reader, error)` - returns raw IPC stream - `HasNext() bool` - `Close()` - `SchemaBytes() ([]byte, error)` - **Updated all row scanner implementations** to support IPC streams - **Added example** demonstrating IPC stream usage ## Benefits - **Performance**: Avoid parsing overhead when forwarding Arrow data - **Flexibility**: Direct access to Arrow binary format for custom processing - **Compatibility**: Easier integration with other Arrow-based systems - **Memory efficiency**: Process streams without loading all records into memory ## Testing - All existing tests pass - Backward compatibility maintained through wrapper pattern - Example provided in `examples/ipcstreams/` ## Usage Example ```go rows, err := db.QueryContext(ctx, "SELECT * FROM table") ipcStreams, err := rows.(dbsqlrows.Rows).GetArrowIPCStreams(ctx) defer ipcStreams.Close() for ipcStreams.HasNext() { reader, err := ipcStreams.Next() // Process raw Arrow IPC stream } Signed-off-by: Jade Wang <jade.wang@databricks.com>
@jadewang-db jadewang-db force-pushed the create-ipc-iterator branch from af7e698 to 73e8594 Compare July 24, 2025 21:07
@jadewang-db
Copy link
Contributor Author

jenkins merge

@jadewang-db jadewang-db merged commit 7bde0dc into databricks:main Jul 24, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants