Fix Issue 34748 - read in datetime as MultiIndex for column headers #34954

Dr-Irv · 2020-06-23T16:04:14Z

closes BUG: read_excel issue - 2 (when "column name" is a date/time) #34748
tests added / passed
- pandas/tests/io/excel/test_readers.py, method test_read_datetime_multiindex
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

looks good - can you test for the other file extensions?

Dr-Irv · 2020-06-23T16:50:36Z

looks good - can you test for the other file extensions?

Just pushed a change. Couldn't figure out how to get an ODS file that would successfully read, so I put in xfail for that

WillAyd · 2020-06-24T15:25:42Z

@Dr-Irv added on ODS file to this if you want to pull locally

…sue34748

Dr-Irv · 2020-06-24T16:05:19Z

@WillAyd Thanks. Latest push merges with master and removes the xfail on ods. Passes on my Windows box....

Dr-Irv · 2020-06-24T17:28:22Z

@WillAyd I'm currently failing because the XLSB file I created via Excel can't be read by pandas using xlrd or openpyxl. Is there a way to create xlsb files for testing that pandas can read?

WillAyd · 2020-06-24T17:57:29Z

pyxlsb is the only engine that can read that format. IIRC there’s also a limitation in that tool where it can’t distinguish datetime types which may be problematic for this (can skip until that gets patched if needed)

…

Sent from my iPhone

On Jun 24, 2020, at 10:28 AM, Irv Lustig ***@***.***> wrote: @WillAyd I'm currently failing because the XLSB file I created via Excel can't be read by pandas using xlrd or openpyxl. Is there a way to create xlsb files for testing that pandas can read? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Dr-Irv · 2020-06-24T18:13:50Z

pyxlsb is the only engine that can read that format. IIRC there’s also a limitation in that tool where it can’t distinguish datetime types which may be problematic for this (can skip until that gets patched if needed)

Thanks @WillAyd . pyxlsb not supporting datetimes was the issue.

WillAyd · 2020-06-24T19:18:46Z

ref #31215 and willtrnr/pyxlsb#10

WillAyd

Cool thanks! Implementation lgtm minor question on test

@jreback

WillAyd · 2020-06-24T19:20:35Z

pandas/tests/io/excel/test_readers.py

+ expected_column_index = pd.MultiIndex.from_tuples(
+ [(pd.to_datetime("02/29/2020"), pd.to_datetime("03/01/2020"))],
+ names=[
+ pd.to_datetime("02/29/2020").to_pydatetime(),


Are the to_pydatetime calls required?

Because when the Excel reader creates the names of the index, the types are of dt.datetime not the pandas datetime

hmm, where is this done? this is unfortunately as these should actually be Timestamp

hmm, where is this done? this is unfortunately as these should actually be Timestamp

So is this a separate issue - that we don't want the names to be dt.datetime ? If so, I will create an issue for that.

jreback

2 questions, patch and tests look good

jreback · 2020-06-24T22:38:24Z

pandas/tests/io/excel/test_readers.py

+ pd.to_datetime("03/01/2020").to_pydatetime(),
+ ],
+ )
+ # Cannot create a DataFrame with `expected_column_index` as the columns


what does this comment mean?

So there is a bug in master (but not in 1.0.5). See below. Should I create a separate issue for this?

In [1]: import pandas as pd In [2]: pd.__version__ Out[2]: '1.1.0.dev0+1952.gc744e35d0.dirty' In [3]: expected_column_index = pd.MultiIndex.from_tuples( ...: [(pd.to_datetime("02/29/2020"), pd.to_datetime("03/01/2020"))], ...: names=[ ...: pd.to_datetime("02/29/2020").to_pydatetime(), ...: pd.to_datetime("03/01/2020").to_pydatetime(), ...: ], ...: ) In [4]: pd.DataFrame([], columns=expected_column_index) --------------------------------------------------------------------------- InvalidIndexError Traceback (most recent call last) <ipython-input-4-e63e2d18b57f> in <module> ----> 1 pd.DataFrame([], columns=expected_column_index) c:\Code\pandas_dev\pandas\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy) 515 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) 516 else: --> 517 mgr = init_dict({}, index, columns, dtype=dtype) 518 else: 519 try: c:\Code\pandas_dev\pandas\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype) 267 nan_dtype = dtype 268 val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype) --> 269 arrays.loc[missing] = [val] * missing.sum() 270 271 else: c:\Code\pandas_dev\pandas\pandas\core\indexing.py in __setitem__(self, key, value) 662 else: 663 key = com.apply_if_callable(key, self.obj) --> 664 indexer = self._get_setitem_indexer(key) 665 self._has_valid_setitem_indexer(key) 666 c:\Code\pandas_dev\pandas\pandas\core\indexing.py in _get_setitem_indexer(self, key) 613 614 try: --> 615 return self._convert_to_indexer(key, axis=0, is_setter=True) 616 except TypeError as e: 617 c:\Code\pandas_dev\pandas\pandas\core\indexing.py in _convert_to_indexer(self, key, axis, is_setter) 1158 # if we are a label return me 1159 try: -> 1160 return labels.get_loc(key) 1161 except LookupError: 1162 if isinstance(key, tuple) and isinstance(labels, ABCMultiIndex): c:\Code\pandas_dev\pandas\pandas\core\indexes\multi.py in get_loc(self, key, method) 2694 if not isinstance(key, (tuple, list)): 2695 # not including list here breaks some indexing, xref #30892 -> 2696 loc = self._get_level_indexer(key, level=0) 2697 return _maybe_to_slice(loc) 2698 c:\Code\pandas_dev\pandas\pandas\core\indexes\multi.py in _get_level_indexer(self, key, level, indexer) 2959 else: 2960 -> 2961 code = self._get_loc_single_level_index(level_index, key) 2962 2963 if level > 0 or self.lexsort_depth == 0: c:\Code\pandas_dev\pandas\pandas\core\indexes\multi.py in _get_loc_single_level_index(self, level_index, key) 2628 return -1 2629 else: -> 2630 return level_index.get_loc(key) 2631 2632 def get_loc(self, key, method=None): c:\Code\pandas_dev\pandas\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance) 569 """ 570 if not is_scalar(key): --> 571 raise InvalidIndexError(key) 572 573 orig_key = key InvalidIndexError: 2020-02-29 00:00:00 2020-03-01 00:00:00 2020-02-29 2020-03-01 True dtype: bool

see above, i think that should work, but these should be coerced to Timestamp, we don't support datetime in Index at all (even as object), they are always coerced.

would be ok with handling this as a pre-cursor PR bug fix i think.

see above, i think that should work, but these should be coerced to Timestamp, we don't support datetime in Index at all (even as object), they are always coerced.

In my example, the index values are TimeStamp, but the Excel reader uses datetime for the names

right, how does the Excel reader do that?

would be ok with handling this as a pre-cursor PR bug fix i think.

So do you mean we hold up on this PR, create an issue for the bug above, and get that fixed first?

so if the other issue can be address first would be best. so i would create an issue. ideally we merge the patch first. if it turns into a quagmire, then can re-visit.

Dr-Irv · 2020-06-25T15:15:15Z

@jreback Summarizing the discussion above. There seem to be 2 separate things raised above:

In current master, if you use TimeStamp as a MultiIndex for the columns, there is a failure:

In [1]: import pandas as pd In [2]: pd.__version__ Out[2]: '1.1.0.dev0+1952.gc744e35d0.dirty' In [3]: expected_column_index = pd.MultiIndex.from_tuples( ...: [(pd.to_datetime("02/29/2020"), pd.to_datetime("03/01/2020"))], ...: names=[ ...: 'a', 'b']) In [4]: pd.DataFrame([], columns=expected_column_index) --------------------------------------------------------------------------- InvalidIndexError

The Excel reader (and in fact all constructors) allow the name of the column index to be a datetime.datetime:

In [1]: import pandas as pd In [2]: import datetime as dt In [3]: df=pd.DataFrame([], columns=pd.Index(["a"], name=dt.datetime(2020,6,25) ...: )) In [4]: df.columns Out[4]: Index(['a'], dtype='object', name=2020-06-25 00:00:00) In [5]: type(df.columns.name) Out[5]: datetime.datetime

It seems to me you are suggesting that I create an issue for (1) and we fix that before merging in this PR.

For (2), I suggest that we don't worry about that, or we could create an issue that says that pandas needs to coerce the names of indexes that are datetime.datetime into Timestamp

Let me know if you'd like to proceed in a different manner.

azure-pipelines · 2020-07-08T18:28:48Z

Commenter does not have sufficient privileges for PR 34954 in repo pandas-dev/pandas

WillAyd · 2020-07-08T19:18:25Z

/azp run

azure-pipelines · 2020-07-08T19:18:35Z

Azure Pipelines successfully started running 1 pipeline(s).

Dr-Irv · 2020-07-08T21:27:51Z

@jreback pinging you here, as now that I fixed the other bug, this PR should be ready for merge

jreback · 2020-07-08T21:53:59Z

@jreback pinging you here, as now that I fixed the other bug, this PR should be ready for merge

don't we need to change to use Timestamps?

Dr-Irv · 2020-07-08T21:59:11Z

@jreback pinging you here, as now that I fixed the other bug, this PR should be ready for merge

don't we need to change to use Timestamps?

See discussion above (#34954 (comment)). That's a separate issue (# 2 in that comment, which I could create if you want), which is whether or not we want to allow the names of a MultiIndex to be datetime objects, which would also entail a change to the Excel reader (and maybe elsewhere), so that a name can't be a datetime but is coerced to a Timestamp

jreback · 2020-07-08T22:04:07Z

thanks @Dr-Irv ok i get it

Fix Issue 34748 - read in datetime as MultiIndex for column headers

7143555

WillAyd reviewed Jun 23, 2020

View reviewed changes

WillAyd added the IO Excel read_excel, to_excel label Jun 23, 2020

add more xls file formats

8b63b89

Dr-Irv and others added 2 commits June 24, 2020 10:34

use testing pattern for other Excel files

1a68eef

added ods file

ab4b49d

Dr-Irv added 3 commits June 24, 2020 12:02

merge with upstream master

d3cd371

Merge branch 'issue34748' of https://github.com/Dr-Irv/pandas into is…

9f3cdb4

…sue34748

remove xfail for ods test

c930a72

skip tests on xlsb with datetimes

c744e35

WillAyd reviewed Jun 24, 2020

View reviewed changes

jreback requested changes Jun 24, 2020

View reviewed changes

jreback added the Bug label Jun 24, 2020

jreback added this to the 1.1 milestone Jun 24, 2020

Dr-Irv added 2 commits June 26, 2020 09:50

merge with master

c94a9b3

merge with master, leverage fix in PR 35140

2ceb77e

jreback approved these changes Jul 8, 2020

View reviewed changes

jreback merged commit c15f080 into pandas-dev:master Jul 8, 2020

Dr-Irv deleted the issue34748 branch February 13, 2023 20:50

Uh oh!

Fix Issue 34748 - read in datetime as MultiIndex for column headers #34954

Fix Issue 34748 - read in datetime as MultiIndex for column headers #34954

Uh oh!

Conversation

Dr-Irv commented Jun 23, 2020

WillAyd left a comment

Choose a reason for hiding this comment

Dr-Irv commented Jun 23, 2020

WillAyd commented Jun 24, 2020

Dr-Irv commented Jun 24, 2020

Dr-Irv commented Jun 24, 2020

WillAyd commented Jun 24, 2020 via email

Dr-Irv commented Jun 24, 2020

WillAyd commented Jun 24, 2020

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv commented Jun 25, 2020

azure-pipelines bot commented Jul 8, 2020

WillAyd commented Jul 8, 2020

azure-pipelines bot commented Jul 8, 2020

Dr-Irv commented Jul 8, 2020

jreback commented Jul 8, 2020

Dr-Irv commented Jul 8, 2020

jreback commented Jul 8, 2020

Labels

3 participants

Dr-Irv Jun 25, 2020 •

edited

Loading