Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jul 19, 2017

This fixes a bug where the FileSystem object is not properly passed to the read_parquet method.

I had to add additional methods to the LocalFileSystem class since pyarrow expects a broader interface. These adaptations should eventually also be included in s3fs to support pyarrow reads from s3.

@mrocklin
Copy link
Member

@wesm
Copy link
Contributor

wesm commented Jul 20, 2017

Thanks @fjetter -- I started a bit of refactoring recently to enable the Dask filesystems (s3fs, gcsfs, etc.) to be used in ParquetDataset (see ARROW-1213).

This patch helps triage use in Dask right now, but the main thing that needs to change is to use the walk API on the filesystems rather than a combination of ls, isdir, and isfile. Some filesystems like S3 make it a little bit awkward to determine isfile or isdir and it's simpler and faster to emulate walk.


@property
def pathsep(self):
return '/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps os.path.sep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalFileSystem.sep is exactly that.

@martindurant
Copy link
Member

In favour of using walk (i.e., produce a set of files within a location - not the python iterator) as, indeed, "isdir" is ill-defined for s3 and gcs (which both happen to have walk defined already).

@fjetter
Copy link
Member Author

fjetter commented Jul 21, 2017

I agree that isdir, isfile, etc. is rather weird for s3 and I was already wondering how one would do that.
How do you propose to continue? Include walk to the local filesystem and wait for https://issues.apache.org/jira/browse/ARROW-1213 to enable pyarrow?

@wesm
Copy link
Contributor

wesm commented Jul 21, 2017

If adding isdir/isfile to Dask's local FS enables directory reads in Dask to work, then that is helpful for testing. I will see if I can get a patch up for ARROW-1213 today or tomorrow -- it will miss the 0.5.0 release but I suspect we have some more testing and hardening to do to make sure that we have all the edge cases for S3 / non-local filesystems sorted out. If you could help with "kicking the tires" that would be much appreciated

We're looking to do 0.6.0 within any luck within a few weeks

cc @yackoa who was also asking me about this

@fjetter fjetter force-pushed the bugfix/filesystem_passed_pyarrow branch from 33e45bf to 097eae1 Compare July 22, 2017 09:01
@fjetter
Copy link
Member Author

fjetter commented Jul 22, 2017

Sure, once you've got a working prototype I should be able to try it out, at least for S3.

@wesm
Copy link
Contributor

wesm commented Aug 4, 2017

@fjetter this should be working in Arrow 0.6.0 (release candidate for this should be out next week) -- a version of this patch is necessary to pass the filesystem to ParquetDataset. Can you take a look at testing this and then we can merge a patch to Dask after 0.6.0 is out?

@fjetter fjetter force-pushed the bugfix/filesystem_passed_pyarrow branch from 097eae1 to 3607162 Compare August 8, 2017 12:18
@fjetter
Copy link
Member Author

fjetter commented Aug 27, 2017

pyarrow still rejects the dask LocalFilesystem since this check is too restrictive which is why the tests are still failing. Although I didn't check yet, I assume the DaskS3Filesystem should be rejected as well. I opened https://issues.apache.org/jira/browse/ARROW-1417 to track this issue

@wesm
Copy link
Contributor

wesm commented Aug 27, 2017

That's a bummer. I never got any feedback from my comment a few weeks ago so would have been nice to fix this before 0.6.0 went out. I need some help with integration testing with Dask (a Dockerfile in the Arrow codebase would be sufficient) so that we can keep this integration stable and working.

0.7.0 is likely ~2 weeks away so let's resolve this before then.

@wesm
Copy link
Contributor

wesm commented Aug 27, 2017

cc @jreback @cpcloud in case you have any spare cycles to help

@fjetter
Copy link
Member Author

fjetter commented Sep 3, 2017

Tests should become green once apache/arrow#1032 is merged and released.

I encountered two issues, though.

  1. I was running the tests on python 2.7 and encountered this issue and could only fix it by using unicode literals. This might be an actual fastparquet bug after all
  2. Setting the index of the pandas data frame explicitly doesn't feel natural to me since I'd expect arrow to perform this action for me. I have the impression that I stumbled either upon an arrow bug or am misusing it somehow. Does anybody know which one it is?

Sorry for taking so much time to fix this!

@wesm
Copy link
Contributor

wesm commented Sep 3, 2017

If the schema metadata is being written and read properly, you should not have to set the pandas index explicitly. So we should track down the origin of the bug. I think @jreback has also done some integration testing of this in pandas

wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2017
…ed to ParquetDataset This way, the `ParquetDataset` accepts both `S3FileSystem` and `LocalFileSystem` objects as they are used in `dask`. By using `issubclass`, external libraries may write their own FS wrappers by inheriting from the arrow FS. I tested the integration with dask and this will fix the issue blocking dask/dask#2527 Author: fjetter <florian.jetter@blue-yonder.com> Closes apache#1032 from fjetter/ARROW-1417 and squashes the following commits: 75f18a5 [fjetter] Remove isinstance check in _ensure_filesystem 302b644 [fjetter] Perform check for type object before issubclass ed111c9 [fjetter] Allow more generic filesystems to be passed
@fjetter fjetter force-pushed the bugfix/filesystem_passed_pyarrow branch 2 times, most recently from 90512de to 2740f44 Compare October 26, 2017 09:11
@fjetter fjetter force-pushed the bugfix/filesystem_passed_pyarrow branch from 46c95ee to fefce3a Compare November 1, 2017 21:13
@fjetter
Copy link
Member Author

fjetter commented Nov 1, 2017

After #2822 was merged this reduces to a minimal fix. The tests seem to break for the same reason master is red

@fjetter
Copy link
Member Author

fjetter commented Nov 9, 2017

cc @mrocklin anything missing?

@TomAugspurger
Copy link
Member

@martindurant, any thoughts?

I'll merge this later today otherwise.

@martindurant
Copy link
Member

OK with me

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks fine, but this whole test file seems to be skipped on travis (not in the logs, and the test as written would fail). I suspect it's the pip install with --no-deps in continuous_integration/travis/install.sh that's the cause. Moving boto3 and moto to install from conda-forge should fix this.

def test_parquet(s3):
@pytest.mark.parametrize("engine", [
'auto',
'pyarrow',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version the dask engine keyword is 'arrow', not 'pyarrow'. This is fixed in #2868, but for now you'll need to change the engine name.

@jcrist
Copy link
Member

jcrist commented Nov 9, 2017

Moving boto3 and moto to install from conda-forge should fix this.

PR: #2875

@jcrist
Copy link
Member

jcrist commented Nov 9, 2017

I pushed a fix for the test here. Once tests pass will merge. Thanks @fjetter.

@jcrist jcrist force-pushed the bugfix/filesystem_passed_pyarrow branch from 7d2303d to 5a7a0f5 Compare November 9, 2017 22:10
@jcrist jcrist merged commit 486bb08 into dask:master Nov 9, 2017
@fjetter fjetter deleted the bugfix/filesystem_passed_pyarrow branch January 30, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

6 participants