The "buffer gap" technique is very well known, I'm familiar with it since the early 90ies. I was thinking about it, but I think it won't work with UTF-8. If you have figured out how you would make it work with UTF-8, then please tell us.
Here is why I think it won't work with UTF-8. The problem is that you can't move characters from before the gap to after or the other way round and change them when there are edits. If some characters are changed, they might change their byte length. But if you want to keep the string as valid UTF-8, you have to constantly fix the content of the gap. One could imagine using two separate String objects, one for before the gap and one for after. For before the gap, it actually might work quite well (as long as Ruby doesn't shorten the memory allocated to a string when the string contents is truncated), but for after the gap, it won't work, because every insertion or deletion at the end of the gap will make the string contents shift around.
In my implementation, the gap is kept filled with NUL, and moved to the end of the buffer
when regular expression search is needed.
More generally, what I'm afraid of is that with this, we start to more and more expose String internals. That can easily lead to problems.
Some people may copy a Ruby snippet using byteindex, then add 1 to that index because they think that's how to get to the next character. Others may start to use byteindex everywhere, even if it's absolutely not necessary. Others may demand byte- versions of more and more operations on strings. We have seen all of this in other contexts.
Doesn't this concern apply to
byteslice
?Yes, it does. The less we have of such kinds of methods, the better.
Anyway, one more question: Are you really having performance problems, or are you just worried about performance? Compared to today's hardware speed, human editing is extremely slow, and for most operations, there should be on delay whatever.
Using character indices was slow, but my current implementation uses ASCII-8BIT strings
whose contents are is encoded in UTF-8, so there's no performance problem while editing
Japanese text whose size is over 10MB.
However, the implementation has the following terrible method:
def byteindex(forward, re, pos)
@match_offsets = []
method = forward ? :index : :rindex
adjust_gap(0, point_max)
if @binary
offset = pos
else
offset = @contents[0...pos].force_encoding(Encoding::UTF_8).size
@contents.force_encoding(Encoding::UTF_8)
end
begin
i = @contents.send(method, re, offset)
if i
m = Regexp.last_match
if m.nil?
# A bug of rindex
@match_offsets.push([pos, pos])
pos
else
b = m.pre_match.bytesize
e = b + m.to_s.bytesize
if e <= bytesize
@match_offsets.push([b, e])
match_beg = m.begin(0)
match_str = m.to_s
(1 .. m.size - 1).each do |j|
cb, ce = m.offset(j)
if cb.nil?
@match_offsets.push([nil, nil])
else
bb = b + match_str[0, cb - match_beg].bytesize
be = b + match_str[0, ce - match_beg].bytesize
@match_offsets.push([bb, be])
end
end
b
else
nil
end
end
else
nil
end
ensure
@contents.force_encoding(Encoding::ASCII_8BIT)
end
end
As long as copy-on-write works, the performance of the code would not be so bad, but it
looks terrible.
A text editor is just an example, and my take is that ways to get byte offsets should be
provided because we already have byteslice. Otherwise, byteslice is not so useful.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3