RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/rust-lang/libs-team/issues/116 below:

Lossy UTF8 conversion of owned types (`Vec::<u8>::into_utf8_lossy`). · Issue #116 · rust-lang/libs-team · GitHub

Proposal Problem statement

We should have a function that performs the same lossy conversion as String::from_utf8_lossy, but which goes from Vec<u8> => String, instead of &'a [u8] to Cow<'a, str>.

Motivation, use-cases

Our current function String::from_utf8_lossy optimizes for the case where you need to borrowed the input (a &[u8]) and can work with a borrowed output. It's very nice for this purpose, as it avoids the potentially costly copy in the common¹ case that the input is already valid UTF-8.

Sadly, if you an need owned output (e.g. a String), there is no function in the stdlib that avoids copying for already-valid bytes, even if you're happy giving up your owned input Vec<u8>. In practice, you generally do with an expression like String::from_utf8_lossy(&vec).to_string(), which has the downside of always performing an extra copy if the input was valid UTF8 -- in other words, it pessimizes the already-valid-UTF8 case (it also has the dowside of being slightly strange looking, although rewording it to avoid this is likely possible).

It seems desirable to solve this by adding an analogous function that transforms an owned Vec<u8> (of potentially invalid UTF-8 bytes) into an owned String.

Solution sketches

I think the following API would be a good solution. A possible implementation is provided as well.

impl Vec<u8> {
    pub fn into_utf8_lossy(self) -> String {
        if let Cow::Owned(string) = String::from_utf8_lossy(&self) {
            string
        } else {
            // SAFETY: `String::from_utf8_lossy`'s contract ensures that if
            // it returns a `Cow::Borrowed`, it is identical to the input.
            unsafe { String::from_utf8_unchecked(self) }
        }
    }
}

I explored several other options in the past in the IRLO thread linked below.

Links and related work

An IRLO thread and writeup I made for this around two few years ago: https://internals.rust-lang.org/t/too-many-words-on-a-from-utf8-lossy-variant-that-takes-a-vec-u8/13005. It contains a number of alternative API designs of... varying quality.
bstr (CC @BurntSushi) has a similar API for this, but uses the name bstr::ByteVec::into_string_lossy. I don't have strong opinions.

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

I'm speculating when I suggest that already valid input is the common case, but given that the usefulness of the result of this function is tied to the UTF-8 validity of the input (e.g. if it's mostly invalid UTF-8, then the output is likely to be less readable), it seems generally reasonable. ↩

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4