Project

General

Profile

Actions

Bug #4386

closed

encoding: directive does not affect regex expressions

Bug #4386: encoding: directive does not affect regex expressions

Added by meta (mathew murphy) over 14 years ago. Updated over 14 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]
Backport:
[ruby-core:35171]

Description

=begin
$ cat foo.rb
#!/usr/local/bin/ruby

encoding: UTF-8

puts /foo.*$/.encoding
$ ruby foo.rb
US-ASCII
$
=end

Updated by nobu (Nobuyoshi Nakada) over 14 years ago Actions #1

  • Status changed from Open to Rejected

=begin
It does.

$ ruby

encoding: EUC-JP

puts /\xa1\xa1/.encoding
EUC-JP

$ ruby

encoding: cp932

puts /\x81\xa4/.encoding
Windows-31J

=end

Updated by nobu (Nobuyoshi Nakada) over 14 years ago Actions #2

=begin
US-ASCII only regexps are set to US-ASCII encoding, same as US-ASCII only strings.
It's intentional.
=end

Updated by luislavena (Luis Lavena) over 14 years ago Actions #3

=begin
Nobu,

I don't see what you're showing:

C:\Users\Luis\Desktop>ruby -v
ruby 1.9.2p136 (2010-12-25) [i386-mingw32]

C:\Users\Luis\Desktop>ruby

encoding: UTF-8

puts /.*/.encoding
^Z
US-ASCII

C:\Users\Luis\Desktop>type foo.rb

encoding: UTF-8

puts /.*/.encoding

C:\Users\Luis\Desktop>ruby foo.rb
US-ASCII

We are having problems with this with Cucumber and Windows as regexp even on a file with magic comment of UTF-8 is been captured as US-ASCII and is not matching the results of backticks.

Sorry, but I'm not understanding why Regexps are not following the magic comment as you showed.

=end

Updated by dlh (Daniel Harple) over 14 years ago Actions #4

=begin

US-ASCII only regexps are set to US-ASCII encoding, same as US-ASCII only strings. It's intentional.

In this case Regexp is not consistent with String's behavior:

$ cat test.rb

encoding: utf-8

r = /a/
s = "a"
p [r, r.encoding]
p [s, s.encoding]
$ ruby -v test.rb
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin9.8.0]
[/a/, #Encoding:US-ASCII]
["a", #Encoding:UTF-8]

=end

Updated by drbrain (Eric Hodel) over 14 years ago Actions #5

=begin
This regexp has US-ASCII-only characters:

puts /.*/.encoding

This regexp is has UTF-8 characters so is in UTF-8 encoding:

$ ruby -ve '# coding: UTF-8' -e 'p /π/.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
#Encoding:UTF-8

Of course, you can't mix:

$ ~/.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: US-ASCII' -e 'p /π/.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
-e:2: invalid multibyte char (US-ASCII)
-e:2: invalid multibyte char (US-ASCII)

But you can force:

$ ~/.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: UTF-8' -e 'p /.*/u.encoding'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
-e:2: warning: ambiguous first argument; put parentheses or even spaces
#Encoding:UTF-8

See also ri Regexp in the Encoding section:

A regexp can be matched against a string when they either share an encoding,
or the regexp's encoding is US-ASCII and the string's encoding is
ASCII-compatible.

A US-ASCII regexp matches a UTF-8 string correctly:

$ /.multiruby/install/1.9.2-p136/bin/ruby -ve '# coding: UTF-8' -e 'r = /my ./; p r.encoding, r = "my π", $&'
ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.6.0]
#Encoding:US-ASCII
0
"my π"

'.' matches a UTF-8 character even though the regexp is in US-ASCII which has only one-byte characters.

=end

Updated by usa (Usaku NAKAMURA) over 14 years ago Actions #6

  • Status changed from Rejected to Feedback

=begin
IMHO, the US-ASCII fallback of regexp literals is a spec bug of ruby.
The encoding of regexp literals should be the same as script encoding, like string literals.

But, as Eric said, US-ASCII regexp can match with UTF-8 string.
So, this bug may not cause any troubles, I guess.

If the problem has occurred concretely something, please give the example.
Not so, make this ticket to feature request, because changing this behavior
may causes some compatibility problem.

=end

Updated by headius (Charles Nutter) over 14 years ago Actions #7

=begin
On Wed, Feb 9, 2011 at 6:47 PM, Usaku NAKAMURA wrote:

IMHO, the US-ASCII fallback of regexp literals is a spec bug of ruby.
The encoding of regexp literals should be the same as script encoding, like string literals.

I found it unusual until we started digging into regexp parsing logic
for JRuby. As far as I can tell, the change to US-ASCII is a very
explicit decision, to allow the widest-possible functionality for a
regexp that only needs to match 7-bit ASCII text. Limiting it to only
the encoding specified for the file would mean ASCII-only regexps
could not (necessarily) be used to match a variety of other encodings,
even though logically they'd match just fine.

  • Charlie

=end

Updated by meta (mathew murphy) over 14 years ago Actions #8

=begin
On Wed, Feb 9, 2011 at 18:47, Usaku NAKAMURA wrote:

But, as Eric said, US-ASCII regexp can match with UTF-8 string.
So, this bug may not cause any troubles, I guess.

So long as ASCII regexps will still match UTF-8 strings as if they
were UTF-8 regexps -- i.e. regexps are silently coerced back to UTF-8
when necessary -- I don't see a problem.

I was just surprised that UTF-8 regexps were sometimes silently
converted to ASCII, and that regexps declared with // were ASCII by
default even if I had declared that I wanted UTF-8. I wondered if
there was a reason for doing all that extra processing that I was
missing.

mathew

URL:http://www.pobox.com/~meta/

=end

Updated by naruse (Yui NARUSE) over 14 years ago Actions #9

  • Status changed from Feedback to Rejected
Actions

Also available in: PDF Atom