| msg358623 - (view) | Author: Inada Naoki (methane) *  | Date: 2019-12-18 12:10 |
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode. There are two courses: (a) Support three KINDs in the flexible unicode representation. (b) Get UTF-8 data from the unicode. (a) will be the fastest on CPython, but there are few drawbacks: * This is tightly coupled with CPython implementation. It will be slow on PyPy. * CPython may change the internal representation to UTF-8 in the future, like PyPy. * You can not easily reuse algorithms written in C that handle `char*`. So I believe (b) should be the preferred way. But CPython doesn't provide an efficient way to get UTF-8 from the unicode object. * PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache. * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already. For speed and efficiency, I propose a new API: ``` /* Borrow the UTF-8 C string from the unicode. * * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*. * The returned object is the owner of the *utf8*. You need to Py_DECREF() it after * you finished to using the *utf8*. The owner may be not the unicode. * Returns NULL when the error occurred while decoding the unicode. */ PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len); ``` When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it. Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it. |
| msg358662 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) *  | Date: 2019-12-19 09:43 |
Do you mean some concrete code? Several times I wished similar feature. To get a UTF-8 cache if it exists and encode to UTF-8 without creating a cache otherwise. The private _PyUnicode_UTF8() macro could help if ((s = _PyUnicode_UTF8(str))) { size = _PyUnicode_UTF8_LENGTH(str); tmpbytes = NULL; } else { tmpbytes = _PyUnicode_AsUTF8String(str, "replace"); s = PyBytes_AS_STRING(tmpbytes); size = PyBytes_GET_SIZE(tmpbytes); } but it is not even available outside of unicodeobject.c. PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that it will be easy to implement it in PyPy. It also does not cover all use cases -- sometimes you want to convert to UTF-8 but does not use any memory allocation at all (either use an existing buffer or raise an error if there is no cached UTF-8 or the string is not ASCII). |
| msg358663 - (view) | Author: STINNER Victor (vstinner) *  | Date: 2019-12-19 09:46 |
> The returned object is the owner of the *utf8*. You need to Py_DECREF() it after > you finished to using the *utf8*. The owner may be not the unicode. Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"? Py_buffer would be nice since it already has a pointer attribute (data) and a length attribute, and there is an API to "release" a Py_buffer. It can be marked as read-only, etc. |
| msg358664 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) *  | Date: 2019-12-19 10:01 |
> Would it be possible to use a "container" object like a Py_buffer? Looks like a good idea. int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors) |
| msg358665 - (view) | Author: Inada Naoki (methane) *  | Date: 2019-12-19 10:04 |
> Would it be possible to use a "container" object like a Py_buffer? Is there a way to customize the code executed when a Py_buffer is "released"? It looks nice idea! Py_buffer.obj is decref-ed when releasing the buffer. https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view) { if (!PyUnicode_Check(unicode)) { PyErr_BadArgument(); return NULL; } if (PyUnicode_READY(unicode) == -1) { return NULL; } if (PyUnicode_UTF8(unicode) != NULL) { return PyBuffer_FillInfo(view, unicode, PyUnicode_UTF8(unicode), PyUnicode_UTF8_LENGTH(unicode), 1, PyBUF_CONTIG_RO); } PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL); if (bytes == NULL) { return NULL; } return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO); } |
| msg358666 - (view) | Author: Inada Naoki (methane) *  | Date: 2019-12-19 10:05 |
s/return NULL/return -1/g |
| msg358670 - (view) | Author: STINNER Victor (vstinner) *  | Date: 2019-12-19 10:37 |
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO); Don't you need to DECREF bytes somehow, at least, in case of failure? |
| msg358673 - (view) | Author: Inada Naoki (methane) *  | Date: 2019-12-19 11:20 |
> Don't you need to DECREF bytes somehow, at least, in case of failure? Thanks. I will create a pull request with suggested changes. |
| msg358778 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) *  | Date: 2019-12-21 18:57 |
I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier. |
| msg358860 - (view) | Author: Inada Naoki (methane) *  | Date: 2019-12-25 07:41 |
> I like this idea, but I think that we should at least notify Python-Dev about all additions to the public C API. If somebody have objections or better idea, it is better to know earlier. I created a post about this issue in discuss.python.org. https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909 |
| msg361284 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-02-03 12:10 |
I am still not sure about we should add new API only for avoiding cache. * PyUnicode_AsUTF8String : When we need bytes or want to avoid cache. * PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable. With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, and same speed to PyUnicode_AsUTF8String. ## vs master $ ./python -m pyperf timeit --compare-to=../cpython/python --python-names master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")' master: ..................... 96.6 us +- 3.3 us patched: ..................... 83.3 us +- 0.3 us Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 1.16x faster (-14%) ## vs AsUTF8String $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "こんにちは")' ..................... Mean +- std dev: 83.2 us +- 0.2 us $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")' ..................... Mean +- std dev: 81.9 us +- 2.1 us ## vs AsUTF8String (ASCII) If we can not accept cache, PyUnicode_AsUTF8String is slower than PyUnicode_AsUTF8 when the unicode is ASCII string. PyUnicode_GetUTF8Buffer helps only this case. $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, "hello", "world")' ..................... Mean +- std dev: 37.5 us +- 1.7 us $ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")' ..................... Mean +- std dev: 46.4 us +- 1.6 us |
| msg361285 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-02-03 12:14 |
Attached patch is the benchmark function I used in previous post. |
| msg362766 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-02-27 04:49 |
New changeset 02a4d57263a9846de35b0db12763ff9e7326f62c by Inada Naoki in branch 'master': bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327) https://github.com/python/cpython/commit/02a4d57263a9846de35b0db12763ff9e7326f62c |
| msg364141 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-03-14 03:43 |
New changeset c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b by Inada Naoki in branch 'master': bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659) https://github.com/python/cpython/commit/c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b |
| msg364142 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-03-14 04:24 |
I'm sorry about merging PR 18327, but I can not find enough usage example of the _PyUnicode_GetUTF8Buffer. PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most case. So _PyUnicode_GetUTF8Buffer seems not worth enough. I will revert PR 18327. |
| msg364146 - (view) | Author: Inada Naoki (methane) *  | Date: 2020-03-14 06:59 |
New changeset 3a8c56295d6272ad2177d2de8af4c3f824f3ef92 by Inada Naoki in branch 'master': Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985) https://github.com/python/cpython/commit/3a8c56295d6272ad2177d2de8af4c3f824f3ef92 |
| msg364151 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) *  | Date: 2020-03-14 11:00 |
I though there are at least 3-4 use cases in the core and stdlib. |
|
| Date | User | Action | Args |
| 2022-04-11 14:59:24 | admin | set | github: 83268 |
| 2020-03-14 11:00:07 | serhiy.storchaka | set | messages: + msg364151 |
| 2020-03-14 06:59:47 | methane | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2020-03-14 06:59:31 | methane | set | messages: + msg364146 |
| 2020-03-14 04:27:50 | methane | set | pull_requests: + pull_request18333 |
| 2020-03-14 04:24:17 | methane | set | messages: + msg364142 |
| 2020-03-14 03:45:35 | methane | set | pull_requests: + pull_request18332 |
| 2020-03-14 03:43:26 | methane | set | messages: + msg364141 |
| 2020-02-27 04:49:03 | methane | set | messages: + msg362766 |
| 2020-02-03 12:14:13 | methane | set | files: + bench-asutf8.patch
messages: + msg361285 |
| 2020-02-03 12:10:03 | methane | set | messages: + msg361284 |
| 2020-02-03 10:57:31 | methane | set | pull_requests: + pull_request17701 |
| 2019-12-25 07:41:22 | methane | set | messages: + msg358860 |
| 2019-12-23 11:56:16 | methane | set | pull_requests: + pull_request17140 |
| 2019-12-21 18:57:08 | serhiy.storchaka | set | messages: + msg358778 |
| 2019-12-19 12:36:22 | methane | set | keywords: + patch stage: patch review pull_requests: + pull_request17127 |
| 2019-12-19 11:20:38 | methane | set | messages: + msg358673 |
| 2019-12-19 10:37:20 | vstinner | set | messages: + msg358670 |
| 2019-12-19 10:05:12 | methane | set | messages: + msg358666 |
| 2019-12-19 10:04:29 | methane | set | nosy: - skrah messages: + msg358665
|
| 2019-12-19 10:01:31 | serhiy.storchaka | set | nosy: + skrah messages: + msg358664
|
| 2019-12-19 09:46:19 | vstinner | set | messages: + msg358663 |
| 2019-12-19 09:43:13 | serhiy.storchaka | set | nosy: + serhiy.storchaka messages: + msg358662
|
| 2019-12-18 15:32:26 | vstinner | set | title: No efficient API to get UTF-8 string from unicode object. -> [C API] No efficient C API to get UTF-8 string from unicode object. |
| 2019-12-18 15:29:27 | vstinner | set | nosy: + vstinner
|
| 2019-12-18 12:10:15 | methane | create | |