Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex \B doesn't match empty string #124130

Open
Alcaro opened this issue Sep 16, 2024 · 9 comments
Open

Regex \B doesn't match empty string #124130

Alcaro opened this issue Sep 16, 2024 · 9 comments
Assignees
Labels
3.14 new features, bugs and security fixes topic-regex type-feature A feature request or enhancement

Comments

@Alcaro
Copy link

Alcaro commented Sep 16, 2024

Bug report

Bug description:

>>> import re
>>> list(re.finditer(r'\b', 'e'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>]
>>> list(re.finditer(r'\B', 'e'))
[]
>>> list(re.finditer(r'\b', '%'))
[]
>>> list(re.finditer(r'\B', '%'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 1), match=''>]
>>> list(re.finditer(r'\b', ''))
[]
>>> list(re.finditer(r'\B', ''))
[]

Apparently the empty string neither is nor isn't a word boundary. Is that supposed to happen? \B matches the empty string in every other language I can think of.

Online reproducer: https://godbolt.org/z/8q6fehss7

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

Linked PRs

@Alcaro Alcaro added the type-bug An unexpected behavior, bug, or error label Sep 16, 2024
@brianschubert
Copy link
Contributor

brianschubert commented Sep 16, 2024

The current behavior (\B not matching "") appears to be intentional:

cpython/Lib/test/test_re.py

Lines 896 to 898 in aba42c0

# However, an empty string contains no word boundaries, and also no
# non-boundaries.
self.assertIsNone(re.search(r"\B", ""))

Strictly speaking, re is working as designed here, even if that design is arguably "incorrect" with respect to other regex implementations.

Changing this behavior would technically be a feature request (not a bug), and would likely require a deprecation period. I'd recommend opening a thread on https://discuss.python.org/c/ideas/6 to see if there's community interest in changing this.

For reference, JavaScript treats the empty string as containing a non-word boundary:

>> /\b/.test("")
false
>> /\B/.test("")
true

@y5c4l3
Copy link
Contributor

y5c4l3 commented Sep 16, 2024

Another Perl reference:

$ perl -E 'say "Boundary" if "" =~ /\b/; say "Non-boundary" if "" =~ /\B/;'
Non-boundary

Seems that it is taking Perl as a baseline to some extent, as indicated in test_re:

cpython/Lib/test/test_re.py

Lines 896 to 901 in 9017b95

# However, an empty string contains no word boundaries, and also no
# non-boundaries.
self.assertIsNone(re.search(r"\B", ""))
# This one is questionable and different from the perlre behaviour,
# but describes current behavior.
self.assertIsNone(re.search(r"\b", ""))

However the comments really confuse me since the behavior for \b atm is same as Perl's but it's still claiming different from the perlre behaviour. Or have I missed some changes on the behavior of \b later on? If that was the case, I think it's also reasonable to align the behavior of \B to Perl's.

@terryjreedy terryjreedy added topic-regex type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Sep 16, 2024
@terryjreedy
Copy link
Member

However the comments really confuse me

I think both comments apply to the \B case, as that is the one different from Perl. Splitting the comment that way is a bit confusing if one does not know Perl, which most do not now. @serhiy-storchaka can comment, but I don't think we can really change this now.

@serhiy-storchaka
Copy link
Member

This test was added in 5a045b9 (bpo-10713/gh-54922). It was not an assertion for the intended behavior, it was added to ensure that the current behavior would not change. Strictly speaking, the current behavior contradicts the documentation that says that \B is the opposite of \b.

Of course, it may be that the documentation is wrong. But taking into account that the current behavior differs from the behavior of many (if not all) other RE engines, that the code is most likely a copying error (it was not properly tested), that we already did several breaking changes related to zero-width matches in the past, and that it affects only very specific cases, I think that we can and should change this behavior.

It is preferable to emit a FutureWarning first, but I do not know how difficult to do this without producing false positives. If it is too difficult or impossible, we will have no other way as to change the behavior without warning.

@serhiy-storchaka serhiy-storchaka self-assigned this Sep 17, 2024
@serhiy-storchaka serhiy-storchaka added the 3.14 new features, bugs and security fixes label Sep 17, 2024
@gpshead
Copy link
Member

gpshead commented Sep 17, 2024

We've been reluctant to make any changes to the re module behavior in the past as people's existing code depends on all of its behaviors per Hyrum's Law.

@vstinner
Copy link
Member

The best that we can do is to document the bug, if there is a bug, and explain that it's kept for backward compatibility.

@ZeroIntensity
Copy link
Member

The linked PR is now just a documentation note.

vstinner pushed a commit that referenced this issue Sep 23, 2024
Signed-off-by: y5c4l3 <y5c4l3@proton.me>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 23, 2024
…ythonGH-124133)

(cherry picked from commit d3e79d7)

Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com>
Signed-off-by: y5c4l3 <y5c4l3@proton.me>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 23, 2024
…ythonGH-124133)

(cherry picked from commit d3e79d7)

Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com>
Signed-off-by: y5c4l3 <y5c4l3@proton.me>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@vstinner
Copy link
Member

Documentation enhanced by d3e79d7.

vstinner pushed a commit that referenced this issue Sep 23, 2024
…H-124133) (#124329)

gh-124130: Notes on empty string corner case of category `\B` (GH-124133)
(cherry picked from commit d3e79d7)

Signed-off-by: y5c4l3 <y5c4l3@proton.me>
Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Sep 23, 2024
@serhiy-storchaka
Copy link
Member

I afraid that the new note is confusing, because \B (as a part of regular expression) matches an empty string (as a part of the input string).

I also think that this is a legitimate case for changing the behavior, for reasons mentioned above. I planned to work on this later.

There is also a lack of tests. There are three places in the code responsible for this behavior, and only one of them is covered by tests.

Yhg1s pushed a commit that referenced this issue Sep 23, 2024
…H-124133) (#124328)

gh-124130: Notes on empty string corner case of category `\B` (GH-124133)
(cherry picked from commit d3e79d7)

Signed-off-by: y5c4l3 <y5c4l3@proton.me>
Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 24, 2024
…essions (pythonGH-124330)

(cherry picked from commit b82f076)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 24, 2024
…essions (pythonGH-124330)

(cherry picked from commit b82f076)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Sep 24, 2024
…ressions (GH-124330) (GH-124414)

(cherry picked from commit b82f076)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Oct 7, 2024
…ressions (GH-124330) (GH-124413)

(cherry picked from commit b82f076)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants