-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-31714: Improved regular expression documentation. #3907
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments.
There are also a couple of paragraphs with long lines that you could rewrap.
Doc/howto/regex.rst
Outdated
lowercasing doesn't take the current locale into account; it will if you also | ||
set the :const:`LOCALE` flag. | ||
letters, too. Full Unicode matching also works unless the :const:`re.ASCII` | ||
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disable non-ASCII matches
makes it sounds like it won't match any non-ASCII characters, but that is false:
>>> re.match('负鼠', '负鼠', re.ASCII)
<_sre.SRE_Match object; span=(0, 2), match='负鼠'>
I think it would be more correct to say that regex sets will only match characters in the ASCII range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a copy from existing re module documentation. :(
regex sets will only match characters in the ASCII range.
This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.
re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>
But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.
Doc/howto/regex.rst
Outdated
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match | ||
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, | ||
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and | ||
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong: unless I'm mistaken [A-Z]
should be limited to upper case ASCII letters, even with re.UNICODE.
Perhaps you meant to use \w
?
It would also be better to specify what Unicode categories are matched, instead of just providing a few examples and letting the user figure it out from there.
I think this is already explained below, so a link or a mention to that section is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some tests:
>>> unichars = ''.join(chr(cp) for cp in range(0x110000))
>>> ''.join(re.findall('[a-z]', bmp))
'abcdefghijklmnopqrstuvwxyz'
>>> ''.join(re.findall('[A-Z]', bmp))
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> ''.join(re.findall('[a-z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'
>>> ''.join(re.findall('[A-Z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'
Now I understand what the paragraph means: [a-z] and [A-Z] will only match ASCII lower and upper case letters respectively (so 26 chars each), however if re.I is used with either one, since those 4 letters (and only those 4) are valid capitalization of the 26 ASCII letters, they will be matched as well (bringing the total up to 26 + 26 + 4 == 56).
Do you think we should rephrase it to make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very good to make it clearer. Now, since you understand what the paragraph means, could you please suggest a clear wording?
This example is not artificial. See bpo-31672. I think this caveat should be documented specially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the patterns ``[a-z]`` or ``[A-Z]`` are used in combination with the re.I and re.U flag,
they will match the 52 ASCII letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin
capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F,
Latin small letter long s) and 'K' (U+212A, Kelvin sign).
Doc/howto/regex.rst
Outdated
that take account of language differences. For example, if you're | ||
processing encoded French text, you'd want to be able to write ``\w+`` to | ||
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in | ||
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the '...'
here and perhaps add b'\xe9'
and b'\xe7'
within parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I suggested that, is because 'é'
and 'ç'
are Unicode str in Python 3, whereas without quotes they are just letters. The addition of b'\xe9'
and b'\xe7'
might help clarify what is being matched, but it's not essential.
Doc/howto/regex.rst
Outdated
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``. | ||
If your system is configured properly and a French locale is selected, | ||
certain C functions will tell the program that the byte corresponding | ||
``'é'`` should also be considered a letter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corresponding to é
I also don't like certain C functions will tell the program
too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest remove quotes?
certain C functions will tell the program
already was here, it looks correct to me, and I don't know how improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to remove quotes, since we are talking about the character, not about a Unicode string.
Fair enough about the wording being there already.
Doc/howto/regex.rst
Outdated
is very unreliable, and it only handles one "culture" at a time anyway; | ||
and it only works with 8-bit locales; | ||
you should use Unicode matching instead, which is the default in Python 3 | ||
for Unicode (str) patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... is very unreliable, it only handles one "culture" at a time, and it only works with 8-bit locales. You should use ...
Since Unicode matching is the default, I wouldn't say "you should use", but just something like "Unicode matching is already enabled by default in Python 3, and it is able to handle different locales/languages."
Alternation, or the "or" operator. If A and B are regular expressions, | ||
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very | ||
Alternation, or the "or" operator. If *A* and *B* are regular expressions, | ||
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why *
instead of ``
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because A and B are not literal regexpes, but variables.
Doc/library/re.rst
Outdated
matches. | ||
|
||
match lowercase letters. Full Unicode matching (such as ``Ü`` matching | ||
``ü``) also works unless the :const:`re.ASCII` flag is also used to disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is also used to
(you already have one before)
Doc/library/re.rst
Outdated
only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ' | ||
(U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small | ||
letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a, | ||
Kelvin sign). If the :const:`ASCII` flag is used, only letters 'a' to 'z' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated from above, so the same comments applied (maybe it shouldn't be duplicated?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you suggest how to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice the duplicated part was on two separate files, so it's probably ok to leave it.
Doc/library/re.rst
Outdated
you should use Unicode matching instead, which is the default in Python 3 | ||
for Unicode (str) patterns. This flag can be used only with bytes patterns. | ||
for Unicode (str) patterns. | ||
Correcsponds the inline flag ``(?L)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correcsponds to
(extra c and missing to)
Also applies below.
Doc/library/re.rst
Outdated
|
||
>>> int_re = r'\d+' | ||
>>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?')) | ||
\d+(\.\d+)?(e[+-]?\d+)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this example particularly clear. Why would someone want to use re.escape() on the replacement string? Wouldn't using int_re = r'\\d+'
(and a normal str.replace on INT) be easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.
I'll replace this example with the better one.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Doc/howto/regex.rst
Outdated
@@ -1284,12 +1292,12 @@ doesn't work because of the greedy nature of ``.*``. :: | |||
>>> print(re.match('<.*>', s).group()) | |||
<html><head><title>Title</title> | |||
|
|||
The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of | |||
The RE matches the ``<`` in ``'<html>'``, and the ``.*`` consumes the rest of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you do want '<'
as in the original here. See also line 1299 in this commit.
Doc/howto/regex.rst
Outdated
Alternation, or the "or" operator. If A and B are regular expressions, | ||
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very | ||
Alternation, or the "or" operator. If *A* and *B* are regular expressions, | ||
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very | ||
low precedence in order to make it work reasonably when you're alternating | ||
multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything after "will match either" have '[word]'
instead of [word]
?
Doc/howto/regex.rst
Outdated
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match | ||
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, | ||
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and | ||
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't all those after "will match" be '[word]'
instead of [word]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure that I understood your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spam
, spam
, spAM
, or ſpam
-> 'Spam'
, 'spam'
, 'spAM'
, or 'ſpam'
? I might have gotten it wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he's referring to the 4 non-ASCII letters: they should be enclosed in `'..'`
Doc/howto/regex.rst
Outdated
@@ -229,24 +226,23 @@ Another repeating metacharacter is ``+``, which matches one or more times. Pay | |||
careful attention to the difference between ``*`` and ``+``; ``*`` matches | |||
*zero* or more times, so whatever's being repeated may not be present at all, | |||
while ``+`` requires at least *one* occurrence. To use a similar example, | |||
``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match | |||
``ct``. | |||
``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``a``'s), but won't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(3 'a'
s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ezio-melotti and @cryvate! I'll rewrap long lines after resolving all issues with wording.
Doc/howto/regex.rst
Outdated
lowercasing doesn't take the current locale into account; it will if you also | ||
set the :const:`LOCALE` flag. | ||
letters, too. Full Unicode matching also works unless the :const:`re.ASCII` | ||
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a copy from existing re module documentation. :(
regex sets will only match characters in the ASCII range.
This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.
re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>
But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.
Doc/howto/regex.rst
Outdated
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match | ||
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, | ||
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and | ||
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.
Doc/howto/regex.rst
Outdated
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match | ||
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, | ||
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and | ||
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure that I understood your comment.
Doc/howto/regex.rst
Outdated
that take account of language differences. For example, if you're | ||
processing encoded French text, you'd want to be able to write ``\w+`` to | ||
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in | ||
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.
Doc/howto/regex.rst
Outdated
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``. | ||
If your system is configured properly and a French locale is selected, | ||
certain C functions will tell the program that the byte corresponding | ||
``'é'`` should also be considered a letter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest remove quotes?
certain C functions will tell the program
already was here, it looks correct to me, and I don't know how improve it.
Alternation, or the "or" operator. If A and B are regular expressions, | ||
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very | ||
Alternation, or the "or" operator. If *A* and *B* are regular expressions, | ||
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because A and B are not literal regexpes, but variables.
Doc/library/re.rst
Outdated
only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ' | ||
(U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small | ||
letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a, | ||
Kelvin sign). If the :const:`ASCII` flag is used, only letters 'a' to 'z' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you suggest how to avoid duplication?
Doc/library/re.rst
Outdated
|
||
>>> int_re = r'\d+' | ||
>>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?')) | ||
\d+(\.\d+)?(e[+-]?\d+)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.
I'll replace this example with the better one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. I have only one dumb comment; feel free to ignore it :)
Doc/howto/regex.rst
Outdated
@@ -526,7 +522,7 @@ of each one. | |||
+=================================+============================================+ | |||
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | | |||
| | ``\s`` and ``\d`` match only on ASCII | | |||
| | characters with the respective property. | | |||
| | characters with the respective property | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is done for consistency? Any reason these shouldn't be complete sentences? I'd have rather added missing periods instead of removed them. Does this make the diff smaller? Is that worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other reasons besides consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is single sentences in a table, I would normally not have a period, but that's just me.
I didn't expect the Spanish Inquisition! (Neither Italian) |
Nobody expects the Spanish Inquisition! @ezio-melotti: please review the changes made to this pull request. |
There is still (at least) a comment that is not addressed, but GitHub decided to hide it. You can see by clicking the second "show outdated" link from the top, or by checking the latest emails for this issue to make sure you don't miss any comment. |
Oh, sorry, @ezio-melotti, I missed this comment. Please check whether I understood you correctly. |
Thank you all for your reviews and suggestions! |
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6. |
Sorry, @serhiy-storchaka, I could not cleanly backport this to |
GH-3994 is a backport of this pull request to the 3.6 branch. |
https://bugs.python.org/issue31714