Skip to content

Add substr_range, elem_offset, and subslice_range methods #382

Closed
@wr7

Description

Proposal

Add the following methods

impl str {
    fn substr_range(&self, item: &str) -> Option<Range<usize>>;
}
impl<T> [T] {
    fn subslice_range(&self, item: &[T]) -> Option<Range<usize>>;
    fn elem_offset(&self, item: &T) -> Option<usize>;
}

NOTE: These methods are completely different in both behavior and functionality from str::find and friends. They do not compare characters/items inside of the str/slices. Instead they utilize pointer arithmetic to find where a subslice/item is located in the str/slice.

Problem statement

This change attempts to fix two distinct issue. The first of which is the inflexibility of slice::split, str::split, str::matches, and related methods. The second is for parse errors.

Problem 1

str::split and its related functions are really convenient, but I find myself having to manually implement them if I want more complex logic.

For instance, lets say that I want to split up a string by commas. Specifically, I want to do something like str::split_inclusive, but I want to include the separators at the beginning instead of the end.

You cannot do this with str::split

let foo = "a,bc,d";
for sub in foo.split(",") {
    // Now I have the split up string, but it doesn't have the comma at the beginning :-(
}

Instead, you have to manually do it yourself

    let mut foo = "a,bc,d";

    while let Some(comma_idx) = foo.get(1..).map(|f| 1+f.find(",").unwrap_or(f.len())) {
        let sub = &foo[..comma_idx];
        foo = &foo[comma_idx..];

        // sub now has leading comma
    }

Problem 2

Say I have a function for parsing and a helper function. The Range denotes where in the string the error occurs. We can use that Range to tell the user where the error is if one occurs.

pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
    // omitted //
    let a = parse_helper(&in[x..y])?;
    let b = parse_helper(&in[y+1..])?;
   // omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, Range<usize>> { /* omitted */ }

The issue with this is that the range error from parse_helper is relative to the substring passed to it. This means that we have to manually offset the Range inside of parse.

pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
    // omitted //
    let a = parse_helper(&in[x..y]).map_err(|e| e.start + x..e.end + x)?;
    let b = parse_helper(&in[y+1..]).map_err(|e| e.start + y+1..e.end + y+1)?;
   // omitted //
}
// ... //

We could also pass an offset to parse_helper and use that to adjust the returned Range, but that would just move the complexity to the parse_helper function.

This gets even worse if instead of a Range, we have an enum that contains Ranges. Say

enum ParseError {
   Error1(Range<usize>),
   Error2(Range<usize>, Range<usize>),
}

Motivating examples or use cases

Split and similar methods

subslice_offset would allow using indices to extend methods like str::split. We can implement the aforementioned inclusive str::split like so:

let foo = "a,bc,d,";
for sub in foo.split(",") {
    let idx = foo.substr_range(sub).unwrap();
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // Now `sub` is "a", ",bc", ",d", and ","
}

Error handling with string input data

We can also use this method to remove complexity from the code described in Problem 2.

pub fn parse<'a>(in: &'a str) -> Result<usize, &'a str> {
    // omitted //
    let a = parse_helper(&in[x..y])?;
    let b = parse_helper(&in[y+1..])?;
   // omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, &'a str> { /* omitted */ }

Instead of returning Ranges, we can return &'a str. Then, if a caller of parse wants to find where the error occurred, they can do

let input = "9+10-21";
match parse(input) {
    Ok(val) => { /* omitted */ },
    Err(err) => {
        display_error(input.substr_range(err).unwrap());
    },

Solution sketch

impl str {
    fn substr_range(&self, substr: &str) -> Option<Range<usize>> {
        self.as_bytes().subslice_range(substr.as_bytes())
    }
}

impl<T> [T] {
    fn subslice_range(&self, subslice: &[T]) -> Option<Range<usize>> {
        if core::mem::size_of::<T>() == 0 {
            panic!();
        }

        let self_start = self.as_ptr() as usize;
        let subslice_start = subslice.as_ptr() as usize;

        let byte_start = subslice_start.wrapping_sub(self_start);
        let start = byte_start / core::mem::size_of::<T>();
        let end = start.wrapping_add(subslice.len());

        if byte_start % core::mem::size_of::<T>() != 0 {
            return None;
        }

        if start <= self.len() && end <= self.len() {
            Some(start..end)
        } else {
            None
        }
    }

    fn elem_offset(&self, element: &T) -> Option<usize> {
        if core::mem::size_of::<T>() == 0 {
            panic!();
        }

        let self_start = self.as_ptr() as usize;
        let elem_start = element as *const T as usize;

        let byte_offset = elem_start.wrapping_sub(self_start);
        let offset = byte_offset / core::mem::size_of::<T>();

        if byte_offset % core::mem::size_of::<T>() != 0 {
            return None;
        }

        if offset < self.len() {
            Some(offset)
        } else {
            None
        }
    }
}

The following code demonstrates the above behavior

    let foo = "mississippi".to_owned();
    let is = &foo[4..6];

    assert_eq!(is, "is");
    assert_eq!(foo.substr_offset(is), Some(4..6));

    assert_eq!(foo.substr_offset("is"), None); 

Alternatives

Links and related work

subslice_offset crate

Old, deprecated str::subslice_offset. (deprecated here).
This was deprecated with str::find being listed as its replacement, but as mentioned before, this has different functionality.

Original subslice_offset PR

rust-lang/rfcs#2796 (an RFC similar to this but only for slices). The PR for this was abandoned because the author did not have time to make suggested changes.

https://github.com/wr7/rfcs/blob/substr_range/text/3648-substr-range.md - An RFC that I wrote for this

https://stackoverflow.com/questions/50781561/how-to-find-the-starting-offset-of-a-string-slice-of-another-string

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions