PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":
PyUnicode_KIND_1BYTE
(UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.PyUnicode_KIND_2BYTE
(UCS-2): BMP, [U+0000; U+ffff] range.PyUnicode_KIND_4BYTE
(UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.
Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.
This implementation is not leaked in the limited C API. For example, the PyUnicode_FromKindAndData()
function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.
I propose adding two functions:
PyUnicode_AsNativeFormat()
: export to the native formatPyUnicode_FromNativeFormat()
: import from the native formatThese functions are added to the limited C API version 3.14.
Native formats (new constants):
PyUnicode_NATIVE_ASCII
: ASCII string.PyUnicode_NATIVE_UCS1
: UCS-1 string.PyUnicode_NATIVE_UCS2
: UCS-2 string.PyUnicode_NATIVE_UCS4
: UCS-4 string.PyUnicode_NATIVE_UTF8
: UTF-8 string (CPython implementation detail: only supported for import, not used by export).Differences with PyUnicode_FromKindAndData()
:
PyUnicode_NATIVE_ASCII format allows further optimizations.
PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.
API:
#define PyUnicode_NATIVE_ASCII 1 #define PyUnicode_NATIVE_UCS1 2 #define PyUnicode_NATIVE_UCS2 3 #define PyUnicode_NATIVE_UCS4 4 #define PyUnicode_NATIVE_UTF8 5 // Get the content of a string in its native format. // - Return the content, set '*size' and '*native_format' on success. // - Set an exception and return NULL on error. PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat( PyObject *unicode, Py_ssize_t *size, int *native_format); // Create a string object from a native format string. // - Return a reference to a new string object on success. // - Set an exception and return NULL on error. PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat( const void *data, Py_ssize_t size, int native_format);
See the attached pull request for more details.
This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files
Using the stable ABI would reduce the number of wheel packages and so ease their release process.
See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).
Linked PRsRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4