Skip to content

Conversation

@rohanjain101
Copy link
Contributor

@rohanjain101 rohanjain101 commented Mar 14, 2024

>>> a = pd.Timedelta(1152921504609987375) >>> a.to_pytimedelta() datetime.timedelta(days=13343, seconds=86304, microseconds=609988) >>> 

609988 for microseconds but expected is 609987

>>> a = pd.Timedelta(1152921504609987374) >>> a.to_pytimedelta() datetime.timedelta(days=13343, seconds=86304, microseconds=609987) >>> 

In this example, a difference of 1 nanosecond, shouldn't result in a difference of 1 microsecond when converted from ns to us unit.

@rohanjain101 rohanjain101 marked this pull request as ready for review March 14, 2024 21:49
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @rohanjain101

I think both the title of the PR and the description are misleading. You are not really improving the accuracy of to_pytimedelta, and an increase in a nanosecond doesn't cause an increase in a microsecond when calling to_pytimedelta. You simply found the point where the rounding change to the next nanosecond.

I do agree that the way the function works is misleading. The problem is that most people would expect this code:

>>> datetime.timedelta(microseconds=12.5) datetime.timedelta(microseconds=12)

to round up to 13 instead to 12. For good or for bad, this is how Python rounding works:

>>> round(12.5) 12

The round function or the same algorithm is probably used inside of datetime.timedelta, so we get the behavior your experienced when using pandas to_pytimedelta.

The reasoning for how round works is complex, but it's the default rounding strategy for a reason. I'm not sure if in this particular case the rounding strategy you propose could be better. If you think there is a reason to change it, please let us know the reason in the PR description, or maybe better, open an issue to discuss.

I assume from the information provided that you just found Python's rounding strategy misleading, so I'll close this issue, but if there is a better reason than that for this PR, please feel free to add a comment and reopen.

@datapythonista datapythonista added the Timedelta Timedelta data type label Mar 14, 2024
@rohanjain101
Copy link
Contributor Author

rohanjain101 commented Mar 14, 2024

Ok, but the reason this behavior is occurring I don't believe is related to python's round function. It's because python's integer division goes through a floating point which introduces accuracy issues for large integers (and in this case, result is rounded anyway). The issue is with the division, not the rounding done by timedelta.

@datapythonista
Copy link
Member

Sorry, you are correct. I didn't realize the problem with the floating point precision.

I wonder what's the impact on performance of this change, did you execute a benchmark to compare the execution times of both implementations?

@MarcoGorelli @jbrockmendel thoughts on this change?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note in v3.0.0.rst? I think this change looks OK.

(FWIW, Timestamp.to_pydatetime also just seems to truncate the nanoseconds component. It would be nice to match that behavior but that would be an API change)

@rohanjain101
Copy link
Contributor Author

Added whatsnew, are there existing benchmarks for comparing performance?

@rohanjain101 rohanjain101 requested a review from mroeschke March 25, 2024 12:24
@mroeschke mroeschke added this to the 3.0 milestone Mar 25, 2024
@mroeschke mroeschke merged commit cf40e56 into pandas-dev:main Mar 25, 2024
@mroeschke
Copy link
Member

Thanks @rohanjain101

@rohanjain101 rohanjain101 deleted the to_td branch March 25, 2024 23:04
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* improve accuracy of to_pytimedelta * f * f * whatsnew * f --------- Co-authored-by: Rohan Jain <rohanjain@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Timedelta Timedelta data type

3 participants