You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR moves the `Table` class out of the Vector hierarchy and adds optimized dataframe operations to it. Currently implements an optimized `scan()` method, `filter(predicate)`, `count()`, and `countBy(column_name)` (only works on dictionary-encoded columns). Some usage examples, based on the file generated by `js/test/data/tables/generate.py`: ``` js > let table = Table.from(...); > table.count() 1000000 > table.filter(col('lat').gteq(0)).count() 499718 > table.countBy('origin').toJSON() { Charlottesville: 166839, 'New York': 166251, 'San Francisco': 166642, Seattle: 166659, 'Terre Haute': 166756, 'Washington, DC': 166853 } > table.filter(col('lng').gteq(0)).countBy('origin').toJSON() { Charlottesville: 83109, 'New York': 83221, 'San Francisco': 83515, Seattle: 83362, 'Terre Haute': 83314, 'Washington, DC': 83479 } ``` There are performance tests for the dataframe operations, to run them you must first generate the test data by running `npm run create:perfdata`. The PR also includes @trxcllnt's refactor of the JS implementation to make it more closely resemble the C++ implementation. This refactor resolves multiple JIRAs: ARROW-1903, ARROW-1898, ARROW-1502, ARROW-1952 (partially), and ARROW-1985 Author: Paul Taylor <paul.e.taylor@me.com> Author: Brian Hulette <brian.hulette@ccri.com> Author: Brian Hulette <hulettbh@gmail.com> Closes#1482 from TheNeuralBit/table-scan-perf and squashes the following commits: 52f1e0e [Brian Hulette] <, > are not commutative, misc cleanup 04b1838 [Brian Hulette] even more table tests 16b9ccb [Brian Hulette] Merge pull request #4 from trxcllnt/js-cpp-refactor fe300df [Paul Taylor] fix closure es5/umd toString() iterator 3d5240a [Paul Taylor] fix more externs 10c48ad [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor dbe7f81 [Brian Hulette] Add more Table unit tests 1910962 [Brian Hulette] Add optional bind callback to scan 5bdf17f [Brian Hulette] Fix perf 8cf2473 [Brian Hulette] Merge remote-tracking branch 'origin/master' into table-scan-perf 4a41b18 [Paul Taylor] add src/predicate to the list of exports we should save from uglify 5a91fab [Paul Taylor] add more view, predicate externs f6adfb3 [Brian Hulette] Create predicate namespace f7bb0ed [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor e148ee4 [Paul Taylor] Merge branch 'extern-woes' into js-cpp-refactor 25cdc4a [Paul Taylor] add src/predicate to the list of exports we should save from uglify dc7c728 [Paul Taylor] add more view, predicate externs 25e6af7 [Brian Hulette] Create predicate namespace 579ab1f [Brian Hulette] Merge pull request #2 from trxcllnt/js-cpp-refactor f3cde1a [Paul Taylor] fix lint 9769773 [Paul Taylor] fix vector perf tests 016ba78 [Brian Hulette] Merge pull request #1 from trxcllnt/js-cpp-refactor 272d293 [Paul Taylor] Merge pull request #4 from ccri/empty-table 7bc7363 [Brian Hulette] Fix exception for empty Table 8ddce0a [Paul Taylor] check bounds in getChildAt(i) to avoid NPEs f1dead0 [Paul Taylor] compute chunked nested childData list correctly 18807c6 [Paul Taylor] rename ChunkData's fields so it's more clear they're not semantically similar to other similarly named fields 7e43b78 [Paul Taylor] add test:integration npm script a5f200f [Paul Taylor] Merge pull request #3 from ccri/table-from-struct c8cd286 [Brian Hulette] Add Table.fromStruct a00415e [Brian Hulette] Fix perf 54d4f5b [Paul Taylor] lazily allocate table and recordbatch columns, support NestedView's getChildAt(i) method in ChunkedView 40b3638 [Paul Taylor] run integration tests with local data for coverage stats fe31ee0 [Paul Taylor] slice the flat data values before returning an iterator of them e537789 [Paul Taylor] make it easier to run all integration tests from local data c0fd2f9 [Paul Taylor] use the dictionary of the last chunked vector list for chunked dictionary vectors e33c068 [Paul Taylor] Merge pull request #2 from ccri/fixed-size-list 5bb63af [Brian Hulette] Don't read OFFSET vector for FixedSizeList 614b688 [Paul Taylor] add asEpochMs to date and timestamp vectors 87334a5 [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor b7f5bfb [Paul Taylor] rename numRows to length, add table.getColumn() e81082f [Paul Taylor] export vector views, allow cloning data as another type 700a47c [Paul Taylor] export visitors e859e13 [Paul Taylor] fix package.json bin entry 0620cfd [Brian Hulette] use Math.fround 0126dc4 [Brian Hulette] Don't recompute total length e761eee [Brian Hulette] Rename asJSON to toJSON 6c91ed4 [Paul Taylor] Merge branch 'master' of github.com:apache/arrow into js-cpp-refactor-merge_with-table-scan-perf d2b18d5 [Paul Taylor] Merge remote-tracking branch 'ccri/table-scan-perf' into js-cpp-refactor-merge_with-table-scan-perf f3f3b86 [Paul Taylor] rename table.ts to recordbatch.ts in preparation for merging latest e3f629d [Paul Taylor] fix rest of the mangling issues fa7c17a [Paul Taylor] passing all tests except es5 umd mangler ones e20decd [Brian Hulette] Add license headers edcbdbe [Brian Hulette] cleanup 20717d5 [Brian Hulette] Fixed countBy(string) 7244887 [Brian Hulette] Add table unit tests... 6719147 [Brian Hulette] Add DataFrame.countBy operation 2f4a349 [Brian Hulette] Minor tweaks 2e118ab [Brian Hulette] linter a788db3 [Brian Hulette] Cleanup a9fff89 [Brian Hulette] Move Table out of the Vector hierarchy 1d60aa1 [Brian Hulette] Moved DataFrame ops to Table. DataFrame is now an interface e8979ba [Brian Hulette] Refactor DataFrame to extend Vector<StructRow> 6a41d68 [Brian Hulette] clean up table benchmarks 2744c63 [Brian Hulette] Remove Chunked/Simple DataFrame distinction aa999f8 [Brian Hulette] Add DictionaryVector optimization for equals predicate 4d9e8c0 [Brian Hulette] Add concept of predicates for filtering dataframes 796f45d [Brian Hulette] add DataFrame filter and count ops 30f0330 [Brian Hulette] Add basic DataFrame impl ... a1edac2 [Brian Hulette] Add perf tests for table scans d18d915 [Paul Taylor] fix struct and map rows 61dc699 [Paul Taylor] WIP -- refactor types to closer match arrow-cpp 62db338 [Paul Taylor] update dependencies and add es6+ umd targets to jest transform ignore patterns to fix ci 6ff18e9 [Paul Taylor] ship es2015 commonJS in main package to avoid confusion 74e828a [Paul Taylor] fix typings issues (ARROW-1903)
0 commit comments