Skip to content

Conversation

@jbrockmendel
Copy link
Member

This is the first of a bunch of PRs to take the place of #17274.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@codecov
Copy link

codecov bot commented Sep 2, 2017

Codecov Report

Merging #17419 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@ Coverage Diff @@ ## master #17419 +/- ## ========================================== - Coverage 91.15% 91.13% -0.02%  ========================================== Files 163 163 Lines 49534 49534 ========================================== - Hits 45153 45144 -9  - Misses 4381 4390 +9
Flag Coverage Δ
#multiple 88.92% <ø> (ø) ⬆️
#single 40.22% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6aed2e...fb55a8c. Read the comment docs.

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Sep 2, 2017
@sinhrks sinhrks added the Timezones Timezone data dtype label Sep 4, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls show any asv diffs.

from dateutil.tz import gettz as _dateutil_gettz


from pytz.tzinfo import BaseTzInfo as _pytz_BaseTzInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some extra imports here (I know you may use them later), but let's add them later then.

@jbrockmendel
Copy link
Member Author

In order to get asv to run, I had to move timezones.pyx from _libs/tslibs/ to just _libs/.

asv continuous -f 1.1 -E virtualenv master HEAD -b ^timeseries before after ratio [8351f86a] [29669bb5] + 1.09±0ms 1.42±0.06ms 1.30 timeseries.ResampleSeries.time_1min_5min_ohlc + 1.78±0.04ms 1.99±0.01ms 1.11 timeseries.ResampleSeries.time_resample_datetime64 + 2.96±0.01ms 3.26±0.04ms 1.10 timeseries.ToDatetime.time_iso8601_format - 49.0±4μs 43.5±0.2μs 0.89 timeseries.SemiMonthOffset.time_end_apply - 2.55±0.01ms 2.23±0.04ms 0.87 timeseries.ResampleDataFrame.time_max_numpy - 6.62±0.05μs 5.61±0μs 0.85 timeseries.DatetimeIndex.time_timestamp_tzinfo_cons - 28.3±0.2μs 23.9±0.09μs 0.84 timeseries.Offsets.time_custom_bday_cal_incr - 17.9±1μs 14.6±0.05μs 0.82 timeseries.Offsets.time_custom_bday_apply - 19.9±0.8ms 13.5±0.4ms 0.68 timeseries.AsOfDataFrame.time_asof_single - 21.4±2ms 13.8±0.7ms 0.64 timeseries.AsOfDataFrame.time_asof_nan_single - 1.98s 337±7ms 0.17 timeseries.AsOfDataFrame.time_asof - 1.98s 326±5ms 0.16 timeseries.AsOfDataFrame.time_asof_nan SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 

As usual, I don't see anything meaningful here, since the one function being defined in timezones.pyx is verbatim identical to the one it is replacing.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2017

we'll did your code affect asof?

that is a non trivial diff

@jreback
Copy link
Contributor

jreback commented Sep 5, 2017

you can test multiple things via a regex in the bench

timezone is also relevant here

@jbrockmendel
Copy link
Member Author

we'll did your code affect asof?

I don't see anything in asof that depends directly on timezones, and the AsOf and AsOfDataFrame benchmarks use tz-naive indexes. So I have trouble imagining how that would be affected.

that is a non trivial diff

One function is implemented in timezones.pyx:

cdef inline bint _is_utc(object tz): return tz is UTC or isinstance(tz, _dateutil_tzutc) 

This replaces identical (up to a different alias for _dateutil_tzutc) functions in tslib and _libs.index. The rest is boilerplate.

@jbrockmendel
Copy link
Member Author

timezone is also relevant here

A search for tz|timezone in the benchmarks brought up the timeseries.py benchmarks above, 2 hits in binary_ops.py, and 1 hit in groupby.py. Are there others I'm missing?

asv continuous -f 1.1 -E virtualenv master HEAD -b ^binary_ops before after ratio [8351f86a] [29669bb5] + 120±10ms 198±30ms 1.65 binary_ops.Ops.time_frame_multi_and(False, 'default') + 142±8ms 221±30ms 1.56 binary_ops.Ops.time_frame_multi_and(True, 1) + 88.4±2ms 132±20ms 1.49 binary_ops.Ops2.time_frame_float_floor_by_zero + 85.0±1ms 126±5ms 1.49 binary_ops.Ops2.time_frame_float_div + 136±5ms 186±20ms 1.36 binary_ops.Ops.time_frame_multi_and(False, 1) - 1.08±0.05ms 922±10μs 0.85 binary_ops.Timeseries.time_series_timestamp_compare 

Of these, binary_ops.Timeseries.time_series_timestamp_compare is the only one with tz/tzinfo involved.

asv continuous -f 1.1 -E virtualenv master HEAD -b ^groupby.groupby_datetimetz [...] BENCHMARKS NOT SIGNIFICANTLY CHANGED. 
@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

since this code touches virtually all time related code, pls run with -b time to see.

@jbrockmendel
Copy link
Member Author

OK, I just started this running.

For my own edification: I was under the impression that since this PR is moving a function but not changing it that it would obviously not affect perf. Is there some theoretical reason that is wrong? Is this something cython-specific?

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

For my own edification: I was under the impression that since this PR is moving a function but not changing it that it would obviously not affect perf. Is there some theoretical reason that is wrong? Is this something cython-specific?

whenever you touch cython in a non-trivial way, we need to assure perf is comparable. in this case you are moving lots of code it is very easy to drop things. just as you always tests, always run perf.

@jbrockmendel
Copy link
Member Author

pls run with -b time to see.

This is running, but it looks like the "time" regex is catching everything since all the bench methods start with "time_".

@jbrockmendel
Copy link
Member Author

 before after ratio [8351f86a] [29669bb5] ! 15.6s failed n/a gil.nogil_datetime_fields.time_datetime_field_day ! 13.2s failed n/a gil.nogil_datetime_fields.time_datetime_field_daysinmonth ! 29.7s failed n/a gil.nogil_datetime_fields.time_datetime_field_normalize ! 15.0s failed n/a gil.nogil_datetime_fields.time_datetime_field_year ! 38.2s failed n/a gil.nogil_datetime_fields.time_period_to_datetime ! 36.5s failed n/a panel_methods.PanelMethods.time_pct_change_items + 49.3±1μs 1.09±0.03s 22102.18 indexing.Int64Indexing.time_loc_scalar + 407±7μs 1.17±0.02s 2866.16 indexing.Int64Indexing.time_ix_list_like + 361±10μs 877±9ms 2427.51 indexing.Int64Indexing.time_getitem_list_like + 557±9μs 1.15±0.05s 2064.80 indexing.Int64Indexing.time_loc_list_like + 1.34±0.04ms 1.08±0.02s 807.81 indexing.Int64Indexing.time_ix_array + 1.68±0.02ms 1.04±0.02s 619.33 indexing.Int64Indexing.time_loc_array + 222±5μs 97.8±30ms 440.73 indexing.MultiIndexing.time_series_xs_mi_ix + 3.44±0.2ms 966±10ms 280.50 indexing.Int64Indexing.time_getitem_lists + 18.1±70ms 3.54s 196.26 join_merge.ConcatFrames.time_c_ordered_axis0 + 24.8±70ms 3.70s 149.36 join_merge.ConcatFrames.time_f_ordered_axis0 + 1.94±0.2ms 230±10ms 118.68 categoricals.Categoricals.time_concat + 267±10μs 5.14±0.7ms 19.23 indexing.MultiIndexing.time_frame_xs_mi_ix + 13.0±1ms 234±100ms 17.96 binary_ops.TimeseriesTZ.time_timestamp_ops_diff1 + 284±10ms 4.04s 14.23 gil.NoGilGroupby.time_sum_8 + 9.09±2ms 102±20ms 11.25 algorithms.Algorithms.time_factorize_float + 6.49±2ms 56.8±10ms 8.76 algorithms.Algorithms.time_factorize_int + 122±20ms 945ms 7.72 frame_methods.Shift.time_shift_axis0 + 129±10ms 829±30ms 6.44 frame_methods.Shift.time_shift_axis_1 + 259±20ms 1.65s 6.37 frame_methods.Reindex.time_reindex_both_axes + 1.41±0.02s 8.55s 6.07 gil.NoGilGroupby.time_groups_2 + 247±2ms 1.49±0.09s 6.05 frame_methods.Reindex.time_reindex_both_axes_ix + 1.08s 6.47s 6.00 join_merge.ConcatPanels.time_c_ordered_axis0 + 400±6ms 2.35s 5.86 indexing.MultiIndexing.time_multiindex_large_get_loc + 74.8±1ms 436±20ms 5.83 index_object.SetOperations.time_int64_union + 187±20ms 1.02s 5.48 index_object.Multi1.time_duplicated + 1.19s 6.35s 5.33 join_merge.ConcatPanels.time_f_ordered_axis0 + 1.13s 5.86s 5.17 join_merge.ConcatPanels.time_f_ordered_axis1 + 172±9ms 887±0.1ms 5.16 indexing.Int64Indexing.time_getitem_array + 666±200ms 3.11s 4.66 gil.NoGilGroupby.time_sum_8_notp + 86.6±20ms 398±10ms 4.60 index_object.SetOperations.time_int64_intersection + 1.81s 8.29s 4.57 join_merge.ConcatPanels.time_c_ordered_axis2 + 535±40ms 2.39s 4.47 indexing.MultiIndexing.time_multiindex_large_get_loc_warm + 174±2ms 775±10ms 4.45 join_merge.Align.time_series_align_left_monotonic + 95.2±1ms 416±0.09ms 4.37 inference.to_numeric_downcast.time_downcast('datetime64', 'unsigned') + 240±5ms 1.03±0.01s 4.31 join_merge.Align.time_series_align_int64_index + 713±9ms 3.04s 4.26 index_object.Multi2.time_sortlevel_int64 + 124±2ms 516±20ms 4.18 panel_ctor.Constructors1.time_panel_from_dict_all_different_indexes + 113±1ms 458±20ms 4.06 panel_ctor.Constructors4.time_panel_from_dict_two_different_indexes + 3.08±0.03s 12.4s 4.03 gil.NoGilGroupby.time_groups_4 + 1.12s 4.45s 3.96 join_merge.ConcatPanels.time_c_ordered_axis1 + 213±20ms 835±20ms 3.92 frame_methods.frame_duplicated.time_frame_duplicated + 7.47s 25.9s 3.47 gil.NoGilGroupby.time_groups_8 + 322±10ms 1.11±0.03s 3.46 index_object.SetOperations.time_int64_symmetric_difference + 174±6ms 597±20ms 3.43 index_object.SetOperations.time_int64_difference + 370±2ms 1.10±0.01s 2.96 indexing.StringIndexing.time_get_value + 187±0.4ms 523±2ms 2.80 inference.to_numeric_downcast.time_downcast('string-nint', None) + 13.7s 37.6s 2.76 gil.nogil_datetime_fields.time_datetime_to_period + 149±0.5ms 392±8ms 2.63 inference.to_numeric_downcast.time_downcast('string-float', 'integer') + 152±2ms 398±10ms 2.63 inference.to_numeric_downcast.time_downcast('string-float', 'signed') + 205±20ms 514±10ms 2.50 inference.to_numeric_downcast.time_downcast('string-int', None) + 158±0.8ms 362±10ms 2.30 inference.to_numeric_downcast.time_downcast('string-float', 'unsigned') + 288±0.2ms 620±40ms 2.16 inference.to_numeric_downcast.time_downcast('string-int', 'signed') + 263±0.03ms 552±20ms 2.10 inference.to_numeric_downcast.time_downcast('string-int', 'integer') + 11.4±0.3ms 23.4±5ms 2.06 gil.nogil_read_csv.time_read_csv_object + 242±0.9ms 470±10ms 1.94 inference.to_numeric_downcast.time_downcast('string-int', 'float') + 755ms 1.44s 1.91 join_merge.ConcatPanels.time_f_ordered_axis2 + 4.00±0.2ms 6.96±0.8ms 1.74 algorithms.Algorithms.time_duplicated_int + 603ms 958ms 1.59 frame_methods.frame_nunique.time_frame_nunique + 1.45s 2.29s 1.58 eval.Eval.time_mult('python', 'all') + 137±7ms 215ms 1.58 gil.NoGilGroupby.time_sum_4 + 145±0.8μs 224±50μs 1.55 period.period_standard_indexing.time_intersection + 49.4±0.1μs 75.7±4μs 1.53 indexing.Int64Indexing.time_loc_slice + 54.9±0.6μs 82.0±0.5μs 1.49 indexing.Int64Indexing.time_ix_slice + 31.1±0.03ms 46.1±50ms 1.48 panel_ctor.Constructors3.time_panel_from_dict_same_index + 473±7ms 680±20ms 1.44 gil.nogil_take1d_float64.time_nogil_take1d_float64 + 312±3ms 441±3ms 1.41 timeseries.AsOfDataFrame.time_asof_nan + 5.49±0.03ms 7.77±0.6ms 1.41 io_bench.frame_to_csv_date_formatting.time_frame_to_csv_date_formatting + 558±1μs 780±0.7μs 1.40 groupby.GroupBySuite.time_head('int', 100) + 91.7ms 127ms 1.39 packers.JSON.time_write_json_mixed_float_int + 33.3±0.3ms 46.1±2ms 1.38 eval.Eval.time_and('numexpr', 'all') + 320±10ms 442±5ms 1.38 timeseries.AsOfDataFrame.time_asof + 23.0±0.2ms 31.5±0.04ms 1.37 parser_vb.read_csv2.time_comment + 66.5±0.1ms 90.6±0.3ms 1.36 packers.STATA.time_write_stata_with_validation + 82.8±0.3ms 109±7ms 1.32 io_bench.frame_to_csv_mixed.time_frame_to_csv_mixed + 23.9±0.08μs 30.9±0.5μs 1.29 timeseries.Offsets.time_custom_bday_cal_incr_n + 1.21s 1.53s 1.27 join_merge.i8merge.time_i8merge + 65.1±2ms 80.7±1ms 1.24 parser_vb.read_csv_categorical.time_convert_direct + 5.52±0.2ms 6.73±0.3ms 1.22 algorithms.Algorithms.time_duplicated_float + 29.0s 35.3s 1.22 panel_methods.PanelMethods.time_pct_change_major + 1.23s 1.47s 1.19 groupby.Groups.time_groupby_groups('object_large') + 286±0.2ms 338±0.4ms 1.18 inference.to_numeric.time_from_numeric_str + 1.14±0.03ms 1.33±0.01ms 1.16 index_object.Float64.time_mul + 1.51s 1.74s 1.15 timeseries.Iteration.time_iter_periodindex + 1.10±0.01s 1.27s 1.15 reshape.reshape_unstack_large_single_dtype.time_unstack_with_mask + 28.4±0.2μs 32.6±0.08μs 1.15 timeseries.Offsets.time_custom_bday_cal_incr_neg_n + 875±0.4μs 1.00±0.1ms 1.15 period.period_standard_indexing.time_align + 292±2μs 333±7μs 1.14 inference.DtypeInfer.time_uint32 + 26.0s 29.4s 1.13 panel_methods.PanelMethods.time_pct_change_minor + 23.7±0.9ms 26.8±0.3ms 1.13 packers.packers_read_hdf_store.time_packers_read_hdf_store + 7.31s 8.24s 1.13 join_merge.JoinIndex.time_left_outer_join_index + 86.6ms 96.9ms 1.12 packers.JSON.time_write_json_mixed_delta_int_tstamp + 10.1±0.1ms 11.3±0.09ms 1.12 algorithms.Algorithms.time_add_overflow_both_arg_nan + 457±0.5ms 508±3ms 1.11 timeseries.ToDatetime.time_format_no_exact + 240ms 266ms 1.11 gil.nogil_read_csv.time_read_csv_datetime - 17.1±0.2ms 15.5±0.2ms 0.91 categoricals.Categoricals2.time_value_counts - 3.95±0.05ms 3.58±0.08ms 0.91 frame_methods.frame_fillna_many_columns_pad.time_frame_fillna_many_columns_pad - 3.58±0.1ms 3.22±0.07ms 0.90 timeseries.AsOf.time_asof - 1.53±0.03ms 1.37±0ms 0.89 frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('MonthBegin', 2) - 132±3ms 118±0.6ms 0.89 strings.StringMethods.time_get_dummies - 2.67±0.2ms 2.36±0ms 0.89 groupby.GroupBySuite.time_median('int', 10000) - 2.18s 1.93s 0.88 frame_methods.Dropna.time_dropna_axis1_all - 20.1±0.1μs 17.8±0.2μs 0.88 timeseries.Offsets.time_custom_bday_incr - 32.7±0.2μs 28.4±0.1μs 0.87 timeseries.Offsets.time_custom_bday_cal_decr - 459±30ms 398±0.6ms 0.87 timeseries.DatetimeIndex.time_add_offset_slow - 17.2±0.4μs 14.5±0.06μs 0.84 timeseries.Offsets.time_custom_bday_apply - 3.08±0.2ms 2.55±0.01ms 0.83 strings.StringMethods.time_title - 29.8±0.1μs 24.1±0.1μs 0.81 timeseries.Offsets.time_custom_bday_cal_incr - 356ms 271ms 0.76 gil.nogil_read_csv.time_read_csv - 104±0.4ms 73.1±0.7ms 0.70 series_methods.series_isin_int64.time_series_isin_int64_large - 109ms 74.0ms 0.68 packers.JSON.time_write_json_date_index - 1.63±0.03ms 1.05±0.03ms 0.65 binary_ops.Timeseries.time_series_timestamp_compare - 1.75±0.03ms 1.11±0.02ms 0.63 binary_ops.TimeseriesTZ.time_series_timestamp_compare - 254±100ms 159ms 0.63 gil.NoGilGroupby.time_sum_4_notp - 1.73±0.01ms 1.06±0.02ms 0.61 binary_ops.Timeseries.time_timestamp_series_compare - 275ms 163ms 0.59 frame_methods.Dropna.time_dropna_axis0_any - 494±5ms 243±3ms 0.49 inference.to_numeric_downcast.time_downcast('string-nint', 'float') - 474±3ms 220±2ms 0.46 inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned') - 276±5ms 116±2ms 0.42 reshape.reshape_pivot_time_series.time_reshape_pivot_time_series - 1.22s 478ms 0.39 frame_methods.Dropna.time_dropna_axis0_any_mixed_dtypes - 448ms 100ms 0.22 packers.JSON.time_write_json_mixed_float_int_T - 312ms 62.3ms 0.20 frame_methods.Dropna.time_dropna_axis1_any - 257±30ms 49.8±2ms 0.19 categoricals.Categoricals.time_union - 386ms 74.7ms 0.19 packers.JSON.time_write_json_mixed_float_int_str SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 
@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

these benches look really unstable. not sure why. These benches should be very stable unless you have something odd going on.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

you can try setting processor afinity so it runs on a single (same) processor, I recall you have a 'big' machine?

@jbrockmendel
Copy link
Member Author

Affinity is worth a shot.

taskset 03 asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries -b tz -b timezone [...] before after ratio [8351f86a] [29669bb5] + 5.46±0.02ms 6.28±0.1ms 1.15 timeseries.DatetimeIndex.time_dti_tz_factorize + 29.9±0.07μs 33.3±0.3μs 1.11 timeseries.Offsets.time_custom_bday_cal_decr - 31.0±0.05μs 28.1±0.1μs 0.91 timeseries.Offsets.time_custom_bday_decr - 438±2ms 397±1ms 0.91 timeseries.SemiMonthOffset.time_end_incr_rng - 448±20ms 404±0.1ms 0.90 timeseries.DatetimeIndex.time_add_offset_slow - 22.5±0.06μs 19.9±0.1μs 0.88 timeseries.AsOf.time_asof_single SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. 
@jbrockmendel
Copy link
Member Author

Tinkering around with setup.py on an unrelated branch I found that .pyx files like 'pandas/_libs/tslibs/timezones' don't need to go into the depends section of the ext_data. This raises three questions:

  1. Is this observation accurate? Is it platform or version specific?

  2. If so, should I remove the extraneous entries from the ext_data in this PR?

  3. If so, should I make a separate PR removing other extraneous entries currently in ext_data?

topper-123 and others added 8 commits September 7, 2017 19:49
* implemented fix for GH issue pandas-dev#16953 * added tests for fix of issue pandas-dev#16953 * changed comments for git issue to pandas style GH# * changed linelength in tests, so all lines are less than 80 characters * added whatsnew entry * swaped conversion and filtering of values, for plot to also work with object dtypes * refomated code, so len(line) < 80 * changed whatsnew with timedelta and datetime dtypes * added support for datetimetz and extended tests * added reason to pytest.mark.xfail
@jreback
Copy link
Contributor

jreback commented Sep 8, 2017

If so, should I remove the extraneous entries from the ext_data in this PR?

not sure what you think is extraneous. these a represent c dependencies. sure some of tseries_depends / lib_depends might be able to be simplied a bit. new PR.

needs a rebase

@jbrockmendel
Copy link
Member Author

not sure what you think is extraneous. these a represent c dependencies. sure some of tseries_depends / lib_depends might be able to be simplied a bit. new PR.

New PR sounds good. This will be easier to just show than describe.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2017

rebase this and can merge next

setup.py Outdated
+ _pxi_dep['hashtable'])},
'_libs.tslib': {'pyxfile': '_libs/tslib',
'pxdfiles': ['_libs/src/util'],
'pxdfiles': ['_libs/src/util', '_libs/lib'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add this back? ( I agree it looks like it should be here, but what was failing to indicate that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a screwup I made while rebasing. Will remove momentarily.

@jreback jreback added this to the 0.21.0 milestone Sep 11, 2017
@jreback
Copy link
Contributor

jreback commented Sep 11, 2017

ok lgtm. ping on green.

@jbrockmendel
Copy link
Member Author

ping

@jreback jreback merged commit 46856c3 into pandas-dev:master Sep 11, 2017
@jreback
Copy link
Contributor

jreback commented Sep 11, 2017

thanks.

a tip for you. squashing commits makes rebasing much easier. generally commits should follow logical flow of code and be purposeful.

jreback pushed a commit that referenced this pull request Sep 12, 2017
@jbrockmendel jbrockmendel deleted the tslibs-timezones2 branch October 30, 2017 16:24
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Clean Internals Related to non-user accessible pandas implementation Timezones Timezone data dtype