> From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org, casouri@gmail.com > Date: Wed, 23 Nov 2022 09:57:38 -0500 > > >> But my question was not so much pointing out a problem but trying to > >> understand why we chose the more complex code. > > Because we need to compare with byte positions, > > Ah, because we wrote "(in bytes)" in the docstring of > `treesit-max-buffer-size`. That's a rather unusual choice. All other > places were we use(d) a limit on the buffer size it's always been based > on the number of chars. No, not because we wrote "in bytes", but because treesit.c consistently uses byte-counts to make similar tests (with a single exception that I fixed yesterday), and keeps track of byte positions in its data structures. I assumed Yuan Fu did that for a reason, and I see at least a hint in the signature of this function, through which tree-sitter reads buffer text: static const char* treesit_read_buffer (void *parser, uint32_t byte_index, TSPoint position, uint32_t *bytes_read) which uses "byte_index and bytes_read, each of which is an unsigned 32-bit value. And since our hard limit is 4G _bytes_, it didn't seem to me consistent to test smaller limits against character counts, not byte counts. > I doubt it would make a significant difference here either (e.g. not > only the "10 times" memory use of the tree-sitter tree is obviously > a rough approximation, but I doubt it's related to the number of bytes > more than to the number of chars or even the number of lexemes). If someone looks in the tree-sitter source code and tells us that we can compare with character counts instead, I'll be the first to agree.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4