Bug #4386
closedencoding: directive does not affect regex expressions
Description
=begin
$ cat foo.rb
#!/usr/local/bin/ruby
encoding: UTF-8¶
puts /foo.*$/.encoding
$ ruby foo.rb
US-ASCII
$
=end
Updated by nobu (Nobuyoshi Nakada) over 14 years ago
Updated by nobu (Nobuyoshi Nakada) over 14 years ago
=begin
US-ASCII only regexps are set to US-ASCII encoding, same as US-ASCII only strings.
It's intentional.
=end
Updated by luislavena (Luis Lavena) over 14 years ago
=begin
Nobu,
I don't see what you're showing:
C:\Users\Luis\Desktop>ruby -v
ruby 1.9.2p136 (2010-12-25) [i386-mingw32]
C:\Users\Luis\Desktop>ruby
encoding: UTF-8¶
puts /.*/.encoding
^Z
US-ASCII
C:\Users\Luis\Desktop>type foo.rb
encoding: UTF-8¶
puts /.*/.encoding
C:\Users\Luis\Desktop>ruby foo.rb
US-ASCII
We are having problems with this with Cucumber and Windows as regexp even on a file with magic comment of UTF-8 is been captured as US-ASCII and is not matching the results of backticks.
Sorry, but I'm not understanding why Regexps are not following the magic comment as you showed.
=end
Updated by dlh (Daniel Harple) over 14 years ago
=begin
US-ASCII only regexps are set to US-ASCII encoding, same as US-ASCII only strings. It's intentional.
In this case Regexp is not consistent with String's behavior:
$ cat test.rb
encoding: utf-8¶
r = /a/
s = "a"
p [r, r.encoding]
p [s, s.encoding]
$ ruby -v test.rb
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin9.8.0]
[/a/, #Encoding:US-ASCII]
["a", #Encoding:UTF-8]
=end
Updated by drbrain (Eric Hodel) over 14 years ago
=begin
This regexp has US-ASCII-only characters:
puts /.*/.encoding
This regexp is has UTF-8 characters so is in UTF-8 encoding:
$ ruby -ve '# coding: UTF-8' -e 'p /π/.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
#Encoding:UTF-8
Of course, you can't mix:
$ ~/.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: US-ASCII' -e 'p /π/.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
-e:2: invalid multibyte char (US-ASCII)
-e:2: invalid multibyte char (US-ASCII)
But you can force:
$ ~/.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: UTF-8' -e 'p /.*/u.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
#Encoding:UTF-8
See also ri Regexp in the Encoding section:
A regexp can be matched against a string when they either share an encoding,
or the regexp's encoding is US-ASCII and the string's encoding is
ASCII-compatible.
A US-ASCII regexp matches a UTF-8 string correctly:
$ /.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: UTF-8' -e 'r = /my ./; p r.encoding, r = "my π", $&'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
#Encoding:US-ASCII
0
"my π"
'.' matches a UTF-8 character even though the regexp is in US-ASCII which has only one-byte characters.
=end
Updated by usa (Usaku NAKAMURA) over 14 years ago
- Status changed from Rejected to Feedback
=begin
IMHO, the US-ASCII fallback of regexp literals is a spec bug of ruby.
The encoding of regexp literals should be the same as script encoding, like string literals.
But, as Eric said, US-ASCII regexp can match with UTF-8 string.
So, this bug may not cause any troubles, I guess.
If the problem has occurred concretely something, please give the example.
Not so, make this ticket to feature request, because changing this behavior
may causes some compatibility problem.
=end
Updated by headius (Charles Nutter) over 14 years ago
=begin
On Wed, Feb 9, 2011 at 6:47 PM, Usaku NAKAMURA redmine@ruby-lang.org wrote:
IMHO, the US-ASCII fallback of regexp literals is a spec bug of ruby.
The encoding of regexp literals should be the same as script encoding, like string literals.
I found it unusual until we started digging into regexp parsing logic
for JRuby. As far as I can tell, the change to US-ASCII is a very
explicit decision, to allow the widest-possible functionality for a
regexp that only needs to match 7-bit ASCII text. Limiting it to only
the encoding specified for the file would mean ASCII-only regexps
could not (necessarily) be used to match a variety of other encodings,
even though logically they'd match just fine.
- Charlie
=end
Updated by meta (mathew murphy) over 14 years ago
=begin
On Wed, Feb 9, 2011 at 18:47, Usaku NAKAMURA redmine@ruby-lang.org wrote:
But, as Eric said, US-ASCII regexp can match with UTF-8 string.
So, this bug may not cause any troubles, I guess.
So long as ASCII regexps will still match UTF-8 strings as if they
were UTF-8 regexps -- i.e. regexps are silently coerced back to UTF-8
when necessary -- I don't see a problem.
I was just surprised that UTF-8 regexps were sometimes silently
converted to ASCII, and that regexps declared with // were ASCII by
default even if I had declared that I wanted UTF-8. I wondered if
there was a reason for doing all that extra processing that I was
missing.
mathew¶
URL:http://www.pobox.com/~meta/
=end
Updated by naruse (Yui NARUSE) over 14 years ago
- Status changed from Feedback to Rejected