Skip to content

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Sep 7, 2022

This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in .sel().

More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels.

PandasMultiIndex.sel() relies on the underlying pandas.MultiIndex methods like this:

  • use get_loc when all levels are provided with each a scalar label (no slice, no array)
    • always drops the index and returns scalar coordinates for each multi-index level
  • use get_loc_level when only a subset of levels are provided with scalar labels only
    • may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates)
    • if only one level remains: renames the dimension and the corresponding dimension coordinate
  • use get_locs for all other cases.
    • always keeps the multi-index and its coordinates (even if only one item or one level is selected)

This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result.

Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838):

import xarray as xr import pandas as pd midx = pd.MultiIndex.from_product([list("abc"), range(4)], names=("one", "two")) ds = xr.Dataset(coords={"x": midx}) # <xarray.Dataset> # Dimensions: (x: 12) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c' # * two (x) int64 0 1 2 3 0 1 2 3 0 1 2 3 # Data variables: # *empty*
ds.sel(one="a", two=0) # this PR # # <xarray.Dataset> # Dimensions: () # Coordinates: # x object ('a', 0) # one <U1 'a' # two int64 0 # Data variables: # *empty* #  # v2022.3.0 #  # <xarray.Dataset> # Dimensions: () # Coordinates: # x object ('a', 0) # Data variables: # *empty* # 
ds.sel(one="a") # this PR: # # <xarray.Dataset> # Dimensions: (two: 4) # Coordinates: # * two (two) int64 0 1 2 3 # one <U1 'a' # Data variables: # *empty* # # v2022.3.0 #  # <xarray.Dataset> # Dimensions: (two: 4) # Coordinates: # * two (two) int64 0 1 2 3 # Data variables: # *empty* # 
ds.sel(one=slice("a", "b")) # this PR #  # <xarray.Dataset> # Dimensions: (x: 8) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' # * two (x) int64 0 1 2 3 0 1 2 3 # Data variables: # *empty* #  # v2022.3.0 #  # <xarray.Dataset> # Dimensions: (two: 8) # Coordinates: # * two (two) int64 0 1 2 3 0 1 2 3 # Data variables: # *empty* # 
ds.sel(one="a", two=slice(1, 1)) # this PR #  # <xarray.Dataset> # Dimensions: (x: 1) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'a' # * two (x) int64 1 # Data variables: # *empty* #  # v2022.3.0 #  # <xarray.Dataset> # Dimensions: (x: 1) # Coordinates: # * x (x) MultiIndex # - one (x) object 'a' # - two (x) int64 1 # Data variables: # *empty* # 
ds.sel(one=["b", "c"], two=[0, 2]) # this PR #  # <xarray.Dataset> # Dimensions: (x: 4) # Coordinates: # * x (x) object MultiIndex # * one (x) object 'b' 'b' 'c' 'c' # * two (x) int64 0 2 0 2 # Data variables: # *empty* #  # v2022.3.0 #  # ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level) # 
Review only the case where labels are provided for index levels. Allow providing array-like objects as labels. Handle slices in a cleaner way pandas MultiIndex methods are used like this: - use ``pandas.MultiIndex.get_loc`` when all levels are provided with each a scalar label (no slice, no array) - use ``pandas.MultiIndex.get_loc_level`` when only a subset of levels are provided with scalar labels - use ``pandas.MultiIndex.get_locs`` for all other cases.
@benbovy
Copy link
Member Author

benbovy commented Sep 8, 2022

it is now allowed to provide array-like labels.

Hmm not sure if it's a good idea... I find get_locs() a bit confusing like in the example below where a 4-labels array for level "one" returns a 3-items location integer array:

# is the 3rd label ("b") ignored? midx.get_locs((np.array(["b", "a", "b", "c"]), 0)) # array([4, 0, 8])

That differs too much from the vectorized selection based on single pandas indexes...

Fancy indexing with n-d label arrays doesn't work either:

midx.get_locs((np.array([["a", "a"], ["a", "a"]]), 0)) # InvalidIndexError: [['a' 'a'] # ['a' 'a']]

And providing Variable or DataArray objects as labels would make things event harder, unless we ignore their dimension names and coordinates (but then it wouldn't be consistent with vectorized selection based on single pandas indexes).

Probably not worth it then?

@mathause
Copy link
Collaborator

It would be nice to be able to preserve the MultiIndex with sel (e.g. ds.sel(one=["a"]) but if it makes the behavior inconsistent it is no good either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants