bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885

serhiy-storchaka · 2017-10-04T16:38:22Z

https://bugs.python.org/issue31690

warsaw

Questions and suggestions for rewording some of the documentation.

warsaw · 2017-10-05T13:47:35Z

Doc/library/re.rst

+ and :const:`re.X` (verbose), for the part of the expression.
+ (The flags are described in :ref:`contents-of-module-re`.)
+ The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow
+ ``'-'``. Instead using the one of them temporary removes other flags.


The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?

Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?

I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.

The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?

LGTM.

Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?

I meant the flags ASCII, LOCAL and UNICODE specified by letters 'a', 'L' and 'u'. Every of these flags affects the meaning of \w and case-insensitive matching. They are mutually exclusive. The UNICODE and LOCALE flags not affect the part inside (?a:...). And you can nest different flags: (?a: ascii matching (?u: unicode matching )).

I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.

Because the one, and only one of these flags should be set for unambiguity. If you remove the flag that is set, you should add other flag. Thus you would need to write (?a-u:...) for switching from Unicode matching to ASCII matching in a string pattern. This looks cumbersome.

Thanks for the details! I agree with the functionality. What about this for documentation:

"The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can't be combined or follow '-'. Instead, when one of them appears in an inline group, it overrides any of the other two letters in the enclosing group. This override is only in effect for the narrow inline group, and the original flags are restored outside of the group."

There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).

In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).

We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.

In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.

Does all this affect your thought about the wording?

warsaw · 2017-10-05T13:50:45Z

Doc/library/re.rst


+ .. versionchanged:: 3.7
+ The letters ``'a'``, ``'L'`` and ``'u'`` can be used in a scope.
+


I don't think "in a scope" is quite the right choice of words. Maybe "...can be used as inline flags." ...?

These flags already can be used as inline flags: (?a), (?L). But they are not scoped and affect the whole expression.

Inline flags are flags inlined in a pattern string instead of specified as a separate compile() argument.

Maybe "group" instead of "scope"?

warsaw · 2017-10-05T13:51:25Z

Doc/whatsnew/3.7.rst

+--
+
+The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
+can be set for the part of a regular expression.


How about "... can be used as inline regular expression flags." ...?

They can be used as inline regular expression flags before, but affect the whole expression.

How about "...can be used within the scope of a group"?

warsaw · 2017-10-05T13:57:16Z

Lib/sre_constants.py

 # update when constants are added or removed

-MAGIC = 20170530
+MAGIC = 20171005


This occurred to me during the previous discussion regarding optimization efforts for re.compile(). I meant to bring it up on python-dev at the time but got distracted. I'll mention it here and maybe we need to discuss further. I think we might need to be as careful about regex magic numbers and bytecodes as we are for Python bytecodes.

The reason is that compiled regular expressions can be pickled.

Python 3.7.0a1+ (heads/debughook-dirty:28bd8e477d, Oct 4 2017, 23:21:36) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> x = re.compile('foo') >>> from pickle import dumps >>> dumps(x) b'\x80\x03cre\n_compile\nq\x00X\x03\x00\x00\x00fooq\x01K \x86q\x02Rq\x03.'

If I pickle a regex using Python 3.7.0 and the format changes for 3.7.1, then my compiled regexps will not be unpicklable, right? I haven't actually analyzed the regex pickle format, so I could be wrong about that, but if I'm right, then I think that's a problem.

The regex pickle format just contains a text pattern and integer flags. When unpickled re.compile() just is called. This format is compatible while the syntax of regular expressions is compatible.

warsaw · 2017-10-05T13:58:08Z

Lib/test/test_re.py

 # Incompatibilities
 self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE)
- self.assertRaises(ValueError, re.compile, br'(?u)\w')
+ self.assertRaises(re.error, re.compile, br'(?u)\w')


Why the change in exception raised? Isn't that an API break?

I consider this as minor implementation detail. re.error looks more correct to me here. ValueError is related to flags specified as a separate argument, re.error is related to inline flags.

I'm concerned that this may break some code. For example, in Mailman we sometimes have regular expression strings provided by a site admin or list owner. We have to check to see if they are valid by trying to compile them and catching any exceptions. So it seems like if this changes, those checks would fail.

re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.

warsaw · 2017-10-05T13:59:34Z

Misc/NEWS.d/next/Library/2017-10-05-15-14-46.bpo-31690.f0XteV.rst

@@ -0,0 +1,2 @@
+Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a
+regular expression.


How about: "Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as inline flags for regular expressions."

I didn't see a response. What do you think about this suggestion?

Sorry, I answered similar comment in other place. "inline flags" are not correct words here. All these flags already can be used as inline flags ("(?a)", "("L)", "(?u)"). But these inline flags affect the entire regular expression. What is new in this PR it is allowing them to be set only for the part of the RE: "(?a:...)" etc.

Maybe we can call these "group flags"? Thus:

Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as group flags for regular expressions.

?

I have never encountered such an expression, but on the other hand, there seems to be no established terminology. I'm not very like "group flags" because it contains "group", but this group is non-capturing and it is not counted in the groups returned by "match.group()" and other API. But if it looks unambiguous to you, I'll use your wording.

The original entry felt awkwardly worded to me. "Group flags" seems like a natural description. Ultimately, this is just a blurb entry so it's not that critical.

serhiy-storchaka

First than tweak the wording more, I want to apply other documentation changes (#3907).

serhiy-storchaka · 2017-10-06T14:51:28Z

Doc/library/re.rst

+ and :const:`re.X` (verbose), for the part of the expression.
+ (The flags are described in :ref:`contents-of-module-re`.)
+ The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow
+ ``'-'``. Instead using the one of them temporary removes other flags.


There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).

In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).

We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.

In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.

Does all this affect your thought about the wording?

serhiy-storchaka · 2017-10-06T16:27:43Z

Lib/test/test_re.py

 # Incompatibilities
 self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE)
- self.assertRaises(ValueError, re.compile, br'(?u)\w')
+ self.assertRaises(re.error, re.compile, br'(?u)\w')


re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.

warsaw · 2017-10-06T18:12:07Z

Does all this affect your thought about the wording?

Yes, but I'm not sure how ;) Or maybe, I'm not sure how to turn your excellent explanation into something more concise. Let's see how your other documentation changes work out first.

…dated suggestions.

serhiy-storchaka · 2017-10-22T14:32:12Z

Could you please take a look on the updated PR @warsaw?

warsaw

I have a couple of questions, and noticed one spelling error. Other than that, it looks great, so I'll go ahead and approve it, conditional on at least fixing the spelling mistake.

warsaw · 2017-10-22T21:56:59Z

Doc/library/re.rst

+ (default). In byte pattern ``(?L:...)`` switches to locale depending
+ matching, and ``(?a:...)`` switches to ASCII-only matching (default).
+ This override is only in effect for the narrow inline group, and the
+ original matchin mode is restored outside of the group.


s/matchin/matching/

warsaw · 2017-10-22T21:57:19Z

Doc/library/re.rst

+ :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
+ :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
+ and :const:`re.X` (verbose), for the part of the expression.
+ (The flags are described in :ref:`contents-of-module-re`.)


Maybe use a bullet list here? I'll leave that to you.

This will take too much space. Maybe later I'll add a table somewhere.

warsaw · 2017-10-22T21:59:24Z

Misc/NEWS.d/next/Library/2017-10-05-15-14-46.bpo-31690.f0XteV.rst

@@ -0,0 +1,2 @@
+Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a
+regular expression.


I didn't see a response. What do you think about this suggestion?

warsaw

Thanks for taking the time to work with me on this change. It's an important and useful improvement! I'm going to approve this PR, and leave it up to you whether to reword the NEWS file blurb or not, based on my last comment.

serhiy-storchaka · 2017-10-24T20:33:22Z

Thank you @warsaw for your review and help with the documentation.

serhiy-storchaka added 2 commits October 4, 2017 19:29

Add specialized opcodes for ignore-case matching.

8d85e3d

Make flags 'a', 'u' and 'L' scoped.

68275d3

serhiy-storchaka added the type-feature A feature request or enhancement label Oct 4, 2017

the-knights-who-say-ni added the CLA signed label Oct 4, 2017

bedevere-bot added the awaiting merge label Oct 4, 2017

serhiy-storchaka added 2 commits October 5, 2017 15:14

Add tests and the documentation.

c13cb5d

Update the magic number.

6622170

warsaw reviewed Oct 5, 2017

View reviewed changes

Apply Barry's suggestions.

71449b9

serhiy-storchaka commented Oct 6, 2017

View reviewed changes

serhiy-storchaka added 3 commits October 14, 2017 11:43

Merge branch 'master' into re-local-aLu

6658108

Add concrete examples for changing local matching mode and remove out…

e39547d

…dated suggestions.

Wrap long lines.

a367480

warsaw approved these changes Oct 22, 2017

View reviewed changes

Fix a typo.

0b2b351

warsaw approved these changes Oct 24, 2017

View reviewed changes

Update a news entry.

9748964

serhiy-storchaka merged commit 3557b05 into python:master Oct 24, 2017

bedevere-bot removed the awaiting merge label Oct 24, 2017

serhiy-storchaka deleted the re-local-aLu branch October 24, 2017 20:31


		.. versionchanged:: 3.7
		The letters ``'a'``, ``'L'`` and ``'u'`` can be used in a scope.

		@@ -0,0 +1,2 @@
		Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a
		regular expression.

Uh oh!

bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885

bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885

Uh oh!

Conversation

serhiy-storchaka commented Oct 4, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

warsaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warsaw commented Oct 6, 2017

serhiy-storchaka commented Oct 22, 2017

warsaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warsaw left a comment

Choose a reason for hiding this comment

serhiy-storchaka commented Oct 24, 2017

Labels

4 participants

serhiy-storchaka commented Oct 4, 2017 •

edited by bedevere-bot

Loading