Add substr_range
, elem_offset
, and subslice_range
methods #382
Description
Proposal
Add the following methods
impl str {
fn substr_range(&self, item: &str) -> Option<Range<usize>>;
}
impl<T> [T] {
fn subslice_range(&self, item: &[T]) -> Option<Range<usize>>;
fn elem_offset(&self, item: &T) -> Option<usize>;
}
NOTE: These methods are completely different in both behavior and functionality from str::find
and friends. They do not compare characters/items inside of the str
/slice
s. Instead they utilize pointer arithmetic to find where a subslice/item is located in the str
/slice
.
Problem statement
This change attempts to fix two distinct issue. The first of which is the inflexibility of slice::split
, str::split
, str::matches
, and related methods. The second is for parse errors.
Problem 1
str::split
and its related functions are really convenient, but I find myself having to manually implement them if I want more complex logic.
For instance, lets say that I want to split up a string by commas. Specifically, I want to do something like str::split_inclusive
, but I want to include the separators at the beginning instead of the end.
You cannot do this with str::split
let foo = "a,bc,d";
for sub in foo.split(",") {
// Now I have the split up string, but it doesn't have the comma at the beginning :-(
}
Instead, you have to manually do it yourself
let mut foo = "a,bc,d";
while let Some(comma_idx) = foo.get(1..).map(|f| 1+f.find(",").unwrap_or(f.len())) {
let sub = &foo[..comma_idx];
foo = &foo[comma_idx..];
// sub now has leading comma
}
Problem 2
Say I have a function for parsing and a helper function. The Range
denotes where in the string the error occurs. We can use that Range
to tell the user where the error is if one occurs.
pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
// omitted //
let a = parse_helper(&in[x..y])?;
let b = parse_helper(&in[y+1..])?;
// omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, Range<usize>> { /* omitted */ }
The issue with this is that the range error from parse_helper
is relative to the substring passed to it. This means that we have to manually offset the Range
inside of parse
.
pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
// omitted //
let a = parse_helper(&in[x..y]).map_err(|e| e.start + x..e.end + x)?;
let b = parse_helper(&in[y+1..]).map_err(|e| e.start + y+1..e.end + y+1)?;
// omitted //
}
// ... //
We could also pass an offset to parse_helper
and use that to adjust the returned Range
, but that would just move the complexity to the parse_helper
function.
This gets even worse if instead of a Range
, we have an enum that contains Range
s. Say
enum ParseError {
Error1(Range<usize>),
Error2(Range<usize>, Range<usize>),
}
Motivating examples or use cases
Split and similar methods
subslice_offset
would allow using indices to extend methods like str::split
. We can implement the aforementioned inclusive str::split
like so:
let foo = "a,bc,d,";
for sub in foo.split(",") {
let idx = foo.substr_range(sub).unwrap();
let sub = &foo[idx.start.saturating_sub(1)..idx.end];
// Now `sub` is "a", ",bc", ",d", and ","
}
Error handling with string input data
We can also use this method to remove complexity from the code described in Problem 2.
pub fn parse<'a>(in: &'a str) -> Result<usize, &'a str> {
// omitted //
let a = parse_helper(&in[x..y])?;
let b = parse_helper(&in[y+1..])?;
// omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, &'a str> { /* omitted */ }
Instead of returning Range
s, we can return &'a str
. Then, if a caller of parse
wants to find where the error occurred, they can do
let input = "9+10-21";
match parse(input) {
Ok(val) => { /* omitted */ },
Err(err) => {
display_error(input.substr_range(err).unwrap());
},
Solution sketch
impl str {
fn substr_range(&self, substr: &str) -> Option<Range<usize>> {
self.as_bytes().subslice_range(substr.as_bytes())
}
}
impl<T> [T] {
fn subslice_range(&self, subslice: &[T]) -> Option<Range<usize>> {
if core::mem::size_of::<T>() == 0 {
panic!();
}
let self_start = self.as_ptr() as usize;
let subslice_start = subslice.as_ptr() as usize;
let byte_start = subslice_start.wrapping_sub(self_start);
let start = byte_start / core::mem::size_of::<T>();
let end = start.wrapping_add(subslice.len());
if byte_start % core::mem::size_of::<T>() != 0 {
return None;
}
if start <= self.len() && end <= self.len() {
Some(start..end)
} else {
None
}
}
fn elem_offset(&self, element: &T) -> Option<usize> {
if core::mem::size_of::<T>() == 0 {
panic!();
}
let self_start = self.as_ptr() as usize;
let elem_start = element as *const T as usize;
let byte_offset = elem_start.wrapping_sub(self_start);
let offset = byte_offset / core::mem::size_of::<T>();
if byte_offset % core::mem::size_of::<T>() != 0 {
return None;
}
if offset < self.len() {
Some(offset)
} else {
None
}
}
}
The following code demonstrates the above behavior
let foo = "mississippi".to_owned();
let is = &foo[4..6];
assert_eq!(is, "is");
assert_eq!(foo.substr_offset(is), Some(4..6));
assert_eq!(foo.substr_offset("is"), None);
Alternatives
- subslice_offset crate
- str::find (significantly less efficient and has the aforementioned issue)
Links and related work
subslice_offset crate
Old, deprecated str::subslice_offset
. (deprecated here).
This was deprecated with str::find
being listed as its replacement, but as mentioned before, this has different functionality.
rust-lang/rfcs#2796 (an RFC similar to this but only for slices). The PR for this was abandoned because the author did not have time to make suggested changes.
https://github.com/wr7/rfcs/blob/substr_range/text/3648-substr-range.md - An RFC that I wrote for this
Activity