-
- Notifications
You must be signed in to change notification settings - Fork 1.8k
Bugfix: Filesystem object not passed to pyarrow reader #2527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks @fjetter -- I started a bit of refactoring recently to enable the Dask filesystems (s3fs, gcsfs, etc.) to be used in This patch helps triage use in Dask right now, but the main thing that needs to change is to use the |
dask/bytes/local.py Outdated
| | ||
| @property | ||
| def pathsep(self): | ||
| return '/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps os.path.sep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LocalFileSystem.sep is exactly that.
| In favour of using walk (i.e., produce a set of files within a location - not the python iterator) as, indeed, "isdir" is ill-defined for s3 and gcs (which both happen to have |
| I agree that |
| If adding isdir/isfile to Dask's local FS enables directory reads in Dask to work, then that is helpful for testing. I will see if I can get a patch up for ARROW-1213 today or tomorrow -- it will miss the 0.5.0 release but I suspect we have some more testing and hardening to do to make sure that we have all the edge cases for S3 / non-local filesystems sorted out. If you could help with "kicking the tires" that would be much appreciated We're looking to do 0.6.0 within any luck within a few weeks cc @yackoa who was also asking me about this |
33e45bf to 097eae1 Compare | Sure, once you've got a working prototype I should be able to try it out, at least for S3. |
| @fjetter this should be working in Arrow 0.6.0 (release candidate for this should be out next week) -- a version of this patch is necessary to pass the filesystem to |
097eae1 to 3607162 Compare |
|
| That's a bummer. I never got any feedback from my comment a few weeks ago so would have been nice to fix this before 0.6.0 went out. I need some help with integration testing with Dask (a Dockerfile in the Arrow codebase would be sufficient) so that we can keep this integration stable and working. 0.7.0 is likely ~2 weeks away so let's resolve this before then. |
| Tests should become green once apache/arrow#1032 is merged and released. I encountered two issues, though.
Sorry for taking so much time to fix this! |
| If the schema metadata is being written and read properly, you should not have to set the pandas index explicitly. So we should track down the origin of the bug. I think @jreback has also done some integration testing of this in pandas |
…ed to ParquetDataset This way, the `ParquetDataset` accepts both `S3FileSystem` and `LocalFileSystem` objects as they are used in `dask`. By using `issubclass`, external libraries may write their own FS wrappers by inheriting from the arrow FS. I tested the integration with dask and this will fix the issue blocking dask/dask#2527 Author: fjetter <florian.jetter@blue-yonder.com> Closes apache#1032 from fjetter/ARROW-1417 and squashes the following commits: 75f18a5 [fjetter] Remove isinstance check in _ensure_filesystem 302b644 [fjetter] Perform check for type object before issubclass ed111c9 [fjetter] Allow more generic filesystems to be passed
90512de to 2740f44 Compare 46c95ee to fefce3a Compare | After #2822 was merged this reduces to a minimal fix. The tests seem to break for the same reason master is red |
| cc @mrocklin anything missing? |
| @martindurant, any thoughts? I'll merge this later today otherwise. |
| OK with me |
jcrist left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks fine, but this whole test file seems to be skipped on travis (not in the logs, and the test as written would fail). I suspect it's the pip install with --no-deps in continuous_integration/travis/install.sh that's the cause. Moving boto3 and moto to install from conda-forge should fix this.
dask/bytes/tests/test_s3.py Outdated
| def test_parquet(s3): | ||
| @pytest.mark.parametrize("engine", [ | ||
| 'auto', | ||
| 'pyarrow', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current version the dask engine keyword is 'arrow', not 'pyarrow'. This is fixed in #2868, but for now you'll need to change the engine name.
PR: #2875 |
| I pushed a fix for the test here. Once tests pass will merge. Thanks @fjetter. |
Also cleanup a few imports
7d2303d to 5a7a0f5 Compare
This fixes a bug where the FileSystem object is not properly passed to the
read_parquetmethod.I had to add additional methods to the
LocalFileSystemclass sincepyarrowexpects a broader interface. These adaptations should eventually also be included in s3fs to supportpyarrowreads from s3.