-
- Notifications
You must be signed in to change notification settings - Fork 33.4k
bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885
Conversation
warsaw left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions and suggestions for rewording some of the documentation.
Doc/library/re.rst Outdated
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow | ||
| ``'-'``. Instead using the one of them temporary removes other flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
LGTM.
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I meant the flags ASCII, LOCAL and UNICODE specified by letters 'a', 'L' and 'u'. Every of these flags affects the meaning of \w and case-insensitive matching. They are mutually exclusive. The UNICODE and LOCALE flags not affect the part inside (?a:...). And you can nest different flags: (?a: ascii matching (?u: unicode matching )).
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
Because the one, and only one of these flags should be set for unambiguity. If you remove the flag that is set, you should add other flag. Thus you would need to write (?a-u:...) for switching from Unicode matching to ASCII matching in a string pattern. This looks cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details! I agree with the functionality. What about this for documentation:
"The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can't be combined or follow '-'. Instead, when one of them appears in an inline group, it overrides any of the other two letters in the enclosing group. This override is only in effect for the narrow inline group, and the original flags are restored outside of the group."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
| | ||
| .. versionchanged:: 3.7 | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can be used in a scope. | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "in a scope" is quite the right choice of words. Maybe "...can be used as inline flags." ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These flags already can be used as inline flags: (?a), (?L). But they are not scoped and affect the whole expression.
Inline flags are flags inlined in a pattern string instead of specified as a separate compile() argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "group" instead of "scope"?
Doc/whatsnew/3.7.rst Outdated
| -- | ||
| | ||
| The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE` | ||
| can be set for the part of a regular expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "... can be used as inline regular expression flags." ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be used as inline regular expression flags before, but affect the whole expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "...can be used within the scope of a group"?
| # update when constants are added or removed | ||
| | ||
| MAGIC = 20170530 | ||
| MAGIC = 20171005 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This occurred to me during the previous discussion regarding optimization efforts for re.compile(). I meant to bring it up on python-dev at the time but got distracted. I'll mention it here and maybe we need to discuss further. I think we might need to be as careful about regex magic numbers and bytecodes as we are for Python bytecodes.
The reason is that compiled regular expressions can be pickled.
Python 3.7.0a1+ (heads/debughook-dirty:28bd8e477d, Oct 4 2017, 23:21:36) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> x = re.compile('foo') >>> from pickle import dumps >>> dumps(x) b'\x80\x03cre\n_compile\nq\x00X\x03\x00\x00\x00fooq\x01K \x86q\x02Rq\x03.' If I pickle a regex using Python 3.7.0 and the format changes for 3.7.1, then my compiled regexps will not be unpicklable, right? I haven't actually analyzed the regex pickle format, so I could be wrong about that, but if I'm right, then I think that's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pickle format just contains a text pattern and integer flags. When unpickled re.compile() just is called. This format is compatible while the syntax of regular expressions is compatible.
| # Incompatibilities | ||
| self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE) | ||
| self.assertRaises(ValueError, re.compile, br'(?u)\w') | ||
| self.assertRaises(re.error, re.compile, br'(?u)\w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change in exception raised? Isn't that an API break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider this as minor implementation detail. re.error looks more correct to me here. ValueError is related to flags specified as a separate argument, re.error is related to inline flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this may break some code. For example, in Mailman we sometimes have regular expression strings provided by a site admin or list owner. We have to check to see if they are valid by trying to compile them and catching any exceptions. So it seems like if this changes, those checks would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
| @@ -0,0 +1,2 @@ | |||
| Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a | |||
| regular expression. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: "Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as inline flags for regular expressions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a response. What do you think about this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I answered similar comment in other place. "inline flags" are not correct words here. All these flags already can be used as inline flags ("(?a)", "("L)", "(?u)"). But these inline flags affect the entire regular expression. What is new in this PR it is allowing them to be set only for the part of the RE: "(?a:...)" etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can call these "group flags"? Thus:
Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as group flags for regular expressions.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never encountered such an expression, but on the other hand, there seems to be no established terminology. I'm not very like "group flags" because it contains "group", but this group is non-capturing and it is not counted in the groups returned by "match.group()" and other API. But if it looks unambiguous to you, I'll use your wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original entry felt awkwardly worded to me. "Group flags" seems like a natural description. Ultimately, this is just a blurb entry so it's not that critical.
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First than tweak the wording more, I want to apply other documentation changes (#3907).
Doc/library/re.rst Outdated
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow | ||
| ``'-'``. Instead using the one of them temporary removes other flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
| # Incompatibilities | ||
| self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE) | ||
| self.assertRaises(ValueError, re.compile, br'(?u)\w') | ||
| self.assertRaises(re.error, re.compile, br'(?u)\w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
Yes, but I'm not sure how ;) Or maybe, I'm not sure how to turn your excellent explanation into something more concise. Let's see how your other documentation changes work out first. |
| Could you please take a look on the updated PR @warsaw? |
warsaw left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of questions, and noticed one spelling error. Other than that, it looks great, so I'll go ahead and approve it, conditional on at least fixing the spelling mistake.
Doc/library/re.rst Outdated
| (default). In byte pattern ``(?L:...)`` switches to locale depending | ||
| matching, and ``(?a:...)`` switches to ASCII-only matching (default). | ||
| This override is only in effect for the narrow inline group, and the | ||
| original matchin mode is restored outside of the group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/matchin/matching/
| :const:`re.L` (locale dependent), :const:`re.M` (multi-line), | ||
| :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), | ||
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a bullet list here? I'll leave that to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will take too much space. Maybe later I'll add a table somewhere.
| @@ -0,0 +1,2 @@ | |||
| Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a | |||
| regular expression. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a response. What do you think about this suggestion?
warsaw left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to work with me on this change. It's an important and useful improvement! I'm going to approve this PR, and leave it up to you whether to reword the NEWS file blurb or not, based on my last comment.
| Thank you @warsaw for your review and help with the documentation. |
https://bugs.python.org/issue31690