-
- Notifications
You must be signed in to change notification settings - Fork 19.2k
CLN avoid some upcasting when its not the purpose of the test #50493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN avoid some upcasting when its not the purpose of the test #50493
Conversation
414010b to 159e4d9 Compare | df = DataFrame(np.random.randn(n, 3)).astype({0: object}) | ||
| df["dates1"] = date_range("1/1/2012", periods=n) | ||
| df["dates3"] = date_range("1/1/2014", periods=n) | ||
| df.iloc[0, 0] = pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, df[0] is first created of int dtype, and then df.iloc[0, 0] = pd.NaT upcasts it to object. Might as well create it of dtype object in the first place, as the purpose of this test comes in the lines below (df.query(...)
| n = 10 | ||
| df = DataFrame(np.random.randn(n, 3)) | ||
| df = DataFrame(np.random.randn(n, 3)).astype({0: object}) | ||
| df["dates1"] = date_range("1/1/2012", periods=n) | ||
| df["dates3"] = date_range("1/1/2014", periods=n) | ||
| df.iloc[0, 0] = pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| # GH 9201 | ||
| df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]) | ||
| df1 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype( | ||
| {"foo": object} | ||
| ) | ||
| # set one entry to a number in str format | ||
| df1.loc[0, "foo"] = "100" | ||
| | ||
| df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]) | ||
| df2 = DataFrame(np.random.randn(5, 3), columns=["foo", "bar", "baz"]).astype( | ||
| {"foo": object} | ||
| ) | ||
| # set one entry to a non-number str | ||
| df2.loc[0, "foo"] = "a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both df1 and df2, column 'foo' is initially created to be of dtype 'int', and then by setting an element to a string it is upcasting to object. This is done on purpose to check the numeric_only flag below
Might as well declare column 'foo' to be of dtype object from the start then
| "20130901", "20131205", freq="5D", name="Date", inclusive="left" | ||
| ), | ||
| ) | ||
| ).astype({"Buyer": object}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column 'Buyer' is originally set to 0, and then some strings are set (thus upcasting to object). Might as well construct it as object from the start - furthermore, this is the expected dataframe, so it doesn't change what's being tested
| | ||
| df = pd.DataFrame(np.zeros((3, 3))) | ||
| df = pd.DataFrame(np.zeros((3, 3))).astype({2: object}) | ||
| df.iloc[2, 2] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the third column df[2] is initially made up of ints, and then an element is set to ' ' (thus upcasting to object)
might as well set to object from the start, as the purpose of the test is on the line below df.replace(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I would recommend adding comments as otherwise it seems easy for a contributor to see thiese tests and think the astype is unnecessary.
159e4d9 to d4ecbc8 Compare | good call, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Generally, tests follow the pattern:
There are a few places when, whilst creating the test input, some column is created of some dtype (e.g. 'int'), and then some incompatible type is set (e.g. a string
'foo'), thus upcasting the column to dtype object. After that, this input is put through the function which the test is meant to test.Given that the upcasting isn't the purpose of these tests, it'd be cleaner to just create the input right away with object dtype.
The "avoid upcasting to object" part of #50424 does seem uncontroversial, so this would help reduce the diff when a set of PRs to address that will be raised