Skip to content

Conversation

@arw2019
Copy link
Contributor

@arw2019 arw2019 commented Feb 5, 2021

Fixing a bug I introduced in #37355

@jreback jreback added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Feb 5, 2021
@jreback jreback added this to the 1.2.2 milestone Feb 5, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@simonjayhawkins
Copy link
Member

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@arw2019 when do you think you can do this? 1.2.2 release scheduled for today. we have a couple of blockers though.

@simonjayhawkins simonjayhawkins mentioned this pull request Feb 7, 2021
@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@arw2019 when do you think you can do this? 1.2.2 release scheduled for today. we have a couple of blockers though.

@simonjayhawkins this is ready to go assuming nothing comes up in benchmarks. Running them now, will post once they're done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to merge ,a followup comment (which can be here or in a followon for 1.3)

result = ser.astype("object").astype(CategoricalDtype())
tm.assert_series_equal(result, roundtrip_expected)

def test_categorical_int_to_int32(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be named test_categorical_astype_to_int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result = ser.astype("object").astype(CategoricalDtype())
tm.assert_series_equal(result, roundtrip_expected)

def test_categorical_int_to_int32(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize on the any_int_dtype, ideally this also works on any_int_nullable_dtype (but not sure that will just work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - created a new fixture any_int_or_nullable_int_dtype

@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

@jreback addressed test comments here - does work already for all int dtypes, including nullable

@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

Doesn't seem to affect basic Categorical benchmarks (but will run some others that I think hit this)

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b Categorical before after ratio [87f72bb8] [e8d9650c] <categorical-astype-int32~3^2> <categorical-astype-int32~3> - 1.52±0.1μs 868±20ns 0.57 dtypes.Dtypes.time_pandas_dtype(<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>) SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE INCREASED. 
return request.param


@pytest.fixture(params=tm.ALL_INT_DTYPES + tm.ALL_EA_INT_DTYPES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i guess this is a missing one

@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

Some issues in groupby benchmarks (I don't have experience with how noisy these are - can rerun to see)

 before after ratio [098b9707] [97c981f0] <categorical-astype-int32^2^2> <categorical-astype-int32> + 579±7μs 1.99±0.7ms 3.43 groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation') + 509±40μs 1.36±0.4ms 2.68 groupby.GroupByMethods.time_dtype_as_group('float', 'mean', 'transformation') + 281±6μs 592±400μs 2.11 groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'transformation') + 482±9μs 999±500μs 2.07 groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct') + 300±10μs 605±100μs 2.02 groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct') + 521±7μs 1.03±0.6ms 1.97 groupby.GroupByMethods.time_dtype_as_field('datetime', 'nunique', 'direct') + 468±8μs 854±400μs 1.82 groupby.GroupByMethods.time_dtype_as_group('float', 'median', 'direct') + 567±20μs 1.01±0.3ms 1.78 groupby.GroupByMethods.time_dtype_as_group('float', 'quantile', 'transformation') + 471±6μs 693±200μs 1.47 groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'transformation') + 251±5μs 329±30μs 1.31 groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'direct') - 3.22±0.2s 2.27±0.02s 0.70 groupby.GroupByMethods.time_dtype_as_group('int', 'describe', 'direct') - 458±30μs 308±20μs 0.67 groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'transformation') - 227±30μs 153±6μs 0.67 groupby.GroupByMethods.time_dtype_as_group('int', 'count', 'direct') - 391±70μs 259±4μs 0.66 groupby.GroupByMethods.time_dtype_as_group('datetime', 'head', 'direct') - 567±100μs 351±20μs 0.62 groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'direct') - 280±100μs 167±10μs 0.60 groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation') - 701±100μs 405±30μs 0.58 groupby.GroupByMethods.time_dtype_as_group('int', 'bfill', 'direct') - 1.74±1ms 1.01±0.06ms 0.58 groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation') - 679±200μs 367±40μs 0.54 groupby.GroupByMethods.time_dtype_as_group('int', 'var', 'transformation') - 334±100μs 139±10μs 0.42 groupby.GroupByMethods.time_dtype_as_field('datetime', 'size', 'transformation') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED. 
@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

re-running the groupby benchmarks that had been affected:

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods.time_dtype_as_group before after ratio [098b9707] [97c981f0] <categorical-astype-int32^2^2> <categorical-astype-int32> + 276±10μs 974±300μs 3.53 groupby.GroupByMethods.time_dtype_as_group('float', 'cummax', 'direct') + 309±40μs 1.03±0.1ms 3.33 groupby.GroupByMethods.time_dtype_as_group('float', 'cumsum', 'direct') + 149±5ms 352±200ms 2.36 groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'transformation') + 273±6μs 560±200μs 2.05 groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'transformation') + 1.03±0.04ms 1.91±1ms 1.86 groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct') + 321±20μs 556±200μs 1.73 groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation') + 308±3μs 491±40μs 1.59 groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct') + 454±3μs 703±400μs 1.55 groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation') + 581±7μs 853±300μs 1.47 groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct') + 250±3μs 351±200μs 1.41 groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct') + 306±2μs 399±50μs 1.30 groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'transformation') + 457±7μs 533±60μs 1.16 groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct') + 241±2ms 273±10ms 1.13 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct') - 564±200μs 181±9μs 0.32 groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'direct') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED. 
@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

since a different combination seems to be affected on a re-run i'm inclined to think that the perf regressions are (at least largely?) noise

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm am concerned about this led issue

let's push this to 1.2.3 until can look at closer

@jreback jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021
@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

hmm am concerned about this led issue

let's push this to 1.2.3 until can look at closer

ok! anything i can do to move this forward (is it worth getting reliable numbers for some of these with timeit for example?)

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

since categorical is used in groupby this slows down everything

see if it's possible to bypass / special case the conversion (when it doesn't need to be converted); eg it's already int64 - might fix it

@arw2019
Copy link
Contributor Author

arw2019 commented Feb 7, 2021

Benchmarks as of e108f0d (change is to use np.asarray instead of extract_array) are below. The big regressions are (mostly) gone

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods.time_dtype_as_group before after ratio [098b9707] [97c981f0] <maybe_box_native^2> <categorical-astype-int32> + 682±90μs 1.23±4ms 1.81 groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct') - 353±30μs 314±6μs 0.89 groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'direct') - 1.67±0.2ms 1.47±0.02ms 0.88 groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'transformation') - 573±60μs 500±4μs 0.87 groupby.GroupByMethods.time_dtype_as_group('float', 'nunique', 'transformation') - 304±60μs 265±3μs 0.87 groupby.GroupByMethods.time_dtype_as_group('datetime', 'head', 'transformation') - 286±2ms 247±5ms 0.86 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct') - 164±10ms 141±1ms 0.86 groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct') - 1.76±0.3ms 1.46±0.02ms 0.83 groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'direct') - 113±4ms 92.1±5ms 0.81 groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'transformation') - 298±10ms 241±4ms 0.81 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation') - 902±100μs 718±5μs 0.80 groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct') - 581±70μs 450±4μs 0.77 groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation') - 405±60μs 313±2μs 0.77 groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'transformation') - 367±400μs 259±3μs 0.71 groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation') - 706±100μs 488±6μs 0.69 groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation') - 423±20μs 291±4μs 0.69 groupby.GroupByMethods.time_dtype_as_group('float', 'cumsum', 'transformation') - 506±50μs 340±7μs 0.67 groupby.GroupByMethods.time_dtype_as_group('int', 'var', 'transformation') - 428±80μs 283±3μs 0.66 groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct') - 228±50μs 132±2μs 0.58 groupby.GroupByMethods.time_dtype_as_group('object', 'size', 'direct') - 1.50±0.7ms 570±10μs 0.38 groupby.GroupByMethods.time_dtype_as_group('datetime', 'quantile', 'transformation') - 1.75±0.3ms 599±20μs 0.34 groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED. 
@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

ok great

@jreback jreback modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021
@arw2019 arw2019 closed this Feb 8, 2021
@arw2019 arw2019 reopened this Feb 8, 2021
@jreback jreback merged commit f773083 into pandas-dev:master Feb 8, 2021
@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

thanks @arw2019

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

@meeseeksdev backport 1.2.x

@lumberbot-app

This comment has been minimized.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

cc @simonjayhawkins @arw2019 for manual backport

@simonjayhawkins
Copy link
Member

the conflict is due to test_astype_categorical_invalid_conversions, xref #37730 and #37677. there are two instances of this test on 1.2.x

will leave unchanged.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Feb 8, 2021
jorisvandenbossche pushed a commit that referenced this pull request Feb 8, 2021
…gument' (#39676) Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version

3 participants