For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char
came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe
blocks because the safe equivalents have material performance impact.
We should offer a nice way that people can write such "the problem fundamentally only produces ASCII String
s" code without needing unsafe
and without needing spurious UTF-8 validation checks.
After all, to quote std::ascii
,
Motivation, use-casesHowever, at times it makes more sense to only consider the ASCII character set for a specific operation.
I was reminded about this by this comment in rust-lang/rust#105076:
pub fn as_str(&self) -> &str { // SAFETY: self.data[self.alive] is all ASCII characters. unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) } }
But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap()
at the end because other options aren't great.
One might say "oh, just build a String
as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8>
generates substantially less complicated code than pushing one to a String
(https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255
USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString
instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str
or String
.
There should be a core
type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.
[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new)
.
[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos
Solution sketchesIn core::ascii
,
/// One of the 128 Unicode characters from U+0000 through U+007F, often known as /// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset. /// /// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973, /// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf). /// /// # Layout /// /// This type is guaranteed to have a size and alignment of 1 byte. #[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] #[repr(transparent)] struct Char(u8 is 0..=127); impl Debug for Char { … } impl Display for Char { … } impl Char { const fn new(c: char) -> Option<Self> { … } const fn from_u8(x: u8) -> Option<Self> { … } const unsafe fn from_u8_unchecked(x: u8) -> Self { … } } impl From<Char> for char { … } impl From<&[Char]> for &str { … }
In alloc::string
:
impl From<Vec<ascii::Char>> for String { … }
^ this From
being the main idea of the whole thing
Safe code can Char::new(…).unwrap()
since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in const
s, then use the non-reallocating infallible From
s later if they need String
s or &str
s.
I wouldn't put any of these in an initial PR, but as related things
repr(u8)
. That would allow as
casting it, for better or worse.ACK
and DEL
and such.AsRef<str>
s are possible, like on ascii::Char
itself or arrays/vectors thereof
AsRef<[u8]>
s tooString::push_ascii
String: Extend<ascii::Char>
or String: FromIterator<ascii::Char>
&str
(or &[u8]
) back to &[ascii::Char]
[u8; N] -> [ascii::Char; N]
operation it can use in a const
so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();
. Long-term that's called array::map
and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar
, despite the stuttering name, to avoid Char
vs char
confusion?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.
bluebear94, FHTMitchell, WaffleLapkin, lebensterben, c410-f3r and 1 more
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4