Message 142132 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	belopolsky, dangra, ezio.melotti, lemburg, pitrou, sjmachin, spatz123, vstinner
Date	2011-08-15.17:15:30
SpamBayes Score	1.6602758e-09
Marked as misclassified	No
Message-id	<1313428531.37.0.895024758346.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
Here are some benchmarks: Commands: # half of the bytes are invalid ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")' With patch: 1000 loops, best of 3: 854 usec per loop 1000 loops, best of 3: 509 usec per loop 1000 loops, best of 3: 415 usec per loop Without patch: 1000 loops, best of 3: 670 usec per loop 1000 loops, best of 3: 470 usec per loop 1000 loops, best of 3: 382 usec per loop Commands (from the interactive interpreter): # all valid codepoints import timeit b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8") b_dec = b.decode timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100 With patch: 0.03830226898193359 0.03849360942840576 0.03835036039352417 0.03821949005126953 Without patch: 0.03750091791152954 0.037977190017700196 0.04067679166793823 0.038579678535461424 Commands: # near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string b2 = bytes(c for k,c in enumerate(b) if k%5) b2_dec = b2.decode timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10 With patch: 9.645482301712036 6.602735090255737 5.338080596923828 Without patch: 8.124328684806823 5.804249691963196 4.851014900207519 All tests done on wide 3.2. Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant. If the patch is fine I will commit it.

Here are some benchmarks: Commands: # half of the bytes are invalid ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "ignore")' With patch: 1000 loops, best of 3: 854 usec per loop 1000 loops, best of 3: 509 usec per loop 1000 loops, best of 3: 415 usec per loop Without patch: 1000 loops, best of 3: 670 usec per loop 1000 loops, best of 3: 470 usec per loop 1000 loops, best of 3: 382 usec per loop Commands (from the interactive interpreter): # all valid codepoints import timeit b = "".join(chr(c) for c in range(0x110000) if c not in range(0xD800, 0xE000)).encode("utf-8") b_dec = b.decode timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import b_dec').timeit(100)/100 timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import b_dec').timeit(100)/100 With patch: 0.03830226898193359 0.03849360942840576 0.03835036039352417 0.03821949005126953 Without patch: 0.03750091791152954 0.037977190017700196 0.04067679166793823 0.038579678535461424 Commands: # near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string b2 = bytes(c for k,c in enumerate(b) if k%5) b2_dec = b2.decode timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import b2_dec').timeit(10)/10 timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import b2_dec').timeit(10)/10 With patch: 9.645482301712036 6.602735090255737 5.338080596923828 Without patch: 8.124328684806823 5.804249691963196 4.851014900207519 All tests done on wide 3.2. Since the changes are about errors, decoding of valid utf-8 strings is not affected. Decoding with non-strict error handlers and invalid strings are slower, but I don't think the difference is significant. If the patch is fine I will commit it.

History
Date	User	Action	Args
2011-08-15 17:15:31	ezio.melotti	set	recipients: + ezio.melotti, lemburg, sjmachin, belopolsky, pitrou, vstinner, dangra, spatz123
2011-08-15 17:15:31	ezio.melotti	set	messageid: <1313428531.37.0.895024758346.issue8271@psf.upfronthosting.co.za>
2011-08-15 17:15:30	ezio.melotti	link	issue8271 messages
2011-08-15 17:15:30	ezio.melotti	create