-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex \B doesn't match empty string #124130
Comments
The current behavior ( Lines 896 to 898 in aba42c0
Strictly speaking, Changing this behavior would technically be a feature request (not a bug), and would likely require a deprecation period. I'd recommend opening a thread on https://discuss.python.org/c/ideas/6 to see if there's community interest in changing this. For reference, JavaScript treats the empty string as containing a non-word boundary: >> /\b/.test("")
false
>> /\B/.test("")
true |
Another Perl reference: $ perl -E 'say "Boundary" if "" =~ /\b/; say "Non-boundary" if "" =~ /\B/;'
Non-boundary Seems that it is taking Perl as a baseline to some extent, as indicated in Lines 896 to 901 in 9017b95
However the comments really confuse me since the behavior for |
I think both comments apply to the |
This test was added in 5a045b9 (bpo-10713/gh-54922). It was not an assertion for the intended behavior, it was added to ensure that the current behavior would not change. Strictly speaking, the current behavior contradicts the documentation that says that Of course, it may be that the documentation is wrong. But taking into account that the current behavior differs from the behavior of many (if not all) other RE engines, that the code is most likely a copying error (it was not properly tested), that we already did several breaking changes related to zero-width matches in the past, and that it affects only very specific cases, I think that we can and should change this behavior. It is preferable to emit a FutureWarning first, but I do not know how difficult to do this without producing false positives. If it is too difficult or impossible, we will have no other way as to change the behavior without warning. |
We've been reluctant to make any changes to the |
The best that we can do is to document the bug, if there is a bug, and explain that it's kept for backward compatibility. |
The linked PR is now just a documentation note. |
…ythonGH-124133) (cherry picked from commit d3e79d7) Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com> Signed-off-by: y5c4l3 <y5c4l3@proton.me> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
…ythonGH-124133) (cherry picked from commit d3e79d7) Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com> Signed-off-by: y5c4l3 <y5c4l3@proton.me> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Documentation enhanced by d3e79d7. |
…H-124133) (#124329) gh-124130: Notes on empty string corner case of category `\B` (GH-124133) (cherry picked from commit d3e79d7) Signed-off-by: y5c4l3 <y5c4l3@proton.me> Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
I afraid that the new note is confusing, because I also think that this is a legitimate case for changing the behavior, for reasons mentioned above. I planned to work on this later. There is also a lack of tests. There are three places in the code responsible for this behavior, and only one of them is covered by tests. |
…H-124133) (#124328) gh-124130: Notes on empty string corner case of category `\B` (GH-124133) (cherry picked from commit d3e79d7) Signed-off-by: y5c4l3 <y5c4l3@proton.me> Co-authored-by: Y5 <124019959+y5c4l3@users.noreply.github.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
…essions (pythonGH-124330) (cherry picked from commit b82f076) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…essions (pythonGH-124330) (cherry picked from commit b82f076) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Bug report
Bug description:
Apparently the empty string neither is nor isn't a word boundary. Is that supposed to happen? \B matches the empty string in every other language I can think of.
Online reproducer: https://godbolt.org/z/8q6fehss7
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux
Linked PRs
\B
#124133\B
(GH-124133) #124328\B
(GH-124133) #124329The text was updated successfully, but these errors were encountered: