Skip to content

Conversation

@toobaz
Copy link
Member

@toobaz toobaz commented Mar 21, 2017

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. lgtm.

can you also add some tests for nrows/chunksize that they must be positive integers?

- ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`)
- The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`14154`)
- The ``skiprows`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`10882`)
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv`` are no more incompatible (:issue:`15755`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more incompatible -> are supported if both are passed.

@jreback jreback added Enhancement IO CSV read_csv, to_csv labels Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

actually I think we need some tests if chunksize > nrows.

and with get_chunk with nrows & chunksize specified. (mostly what happens at the edge point).

@codecov
Copy link

codecov bot commented Mar 21, 2017

Codecov Report

Merging #15756 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@ Coverage Diff @@ ## master #15756 +/- ## ========================================== - Coverage 91.01% 91% -0.02%  ========================================== Files 143 143 Lines 49371 49370 -1 ========================================== - Hits 44937 44928 -9  - Misses 4434 4442 +8
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (-0.01%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.3% <0%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e753d7...d0288e3. Read the comment docs.

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

@jorisvandenbossche
Copy link
Member

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

@jorisvandenbossche
Copy link
Member

There is currently actually a memory leak/infinite loop when doing eg chunksize=0.5, so would be good to validate chunksize the same as nrows.

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

yep that's a good idea

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

Agree, but since chunksize=0 easily results, as @jorisvandenbossche mentioned, in infinite loops, I would impose chunksize >= 1. And maybe for (partial - asking nrows=0 is fine) coherence we want to impose nrows >= 0?

(I would like this, because negative chunksize/nrows could be misinterpreted as parsing starting from the end...)

In the meanwhile, I have pushed the requested tests for the original edit, in case we want to move the rest/discussion to another PR

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

I think this is kind of wonky. We don't actually support indexing from the end of the file. I would raise on this. (can do in another independent issue).

if size is None:
size = self.chunksize
if self.nrows is not None:
if self._currow >= self.nrows:
Copy link
Contributor

@jreback jreback Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this just

if self.nrows is not None: if self._currow >= self.nrows: size = 0 size = min(....) return self.read(...) 

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raising StopIteration directly here seems odd. as self.read(nrows=0) correctly returns the result (as an empty DataFrame)

Copy link
Member Author

@toobaz toobaz Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a parser's .read() raises (or at least, can legitimately raise) StopIteration only when the current row increases above self.nrows, and this never happens if you keep asking for 0 rows at a time. This is not hypothetical - the "# With nrows" step of test_read_chunksize in pandas/tests/io/parser/common.py hangs for the PythonParser (and possibly others) if I change the code as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

cc @gfyoung if you can have a look.

@jreback jreback added this to the 0.20.0 milestone Mar 21, 2017
@gfyoung
Copy link
Member

gfyoung commented Mar 21, 2017

LGTM!

@jreback jreback closed this in 163d18e Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

thanks @toobaz

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

You're welcome!

@toobaz toobaz deleted the nrows_chunksize branch March 21, 2017 18:54
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#15755 Author: Pietro Battiston <me@pietrobattiston.it> Closes pandas-dev#15756 from toobaz/nrows_chunksize and squashes the following commits: d0288e3 [Pietro Battiston] ENH: support "nrows" and "chunksize" together
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement IO CSV read_csv, to_csv

4 participants