-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Add a new method DataFrame.select to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.
Basic case, select columns. Personally both as a list, or as multiple parameters with *args should be supported for convenience:
df.select("column1", "column2") df.select(["column1", "column2"])Cases to consider.
What if a provided column doesn't exist? I assume we want to raise a ValueError.
What if a column is duplicated? I assume we want to return the column twice.
How to select with a wildcard or regex? Some options:
- Not support them (users can do anything fancy with
df.columnsthemselves. - Assume the column is a regex if name starts by
^and ends with$. For wildcards, I guess it could be ok ifcolumn*is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcard - Accept callables, so users can do
df.select(lambda col: col.startswith("column")) - Have extra parameters
regexlikedf.select(regex="column\d") - Same as 2 by make users enable if explicitly with a flag
df.select("column\d", regex=True)
Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.
What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.
Can anyone think of anything else not trivial for implementing this?