Bug #15210: UTF-8 BOM should be removed from String in internal representation - Ruby - Ruby Issue Tracking System

Updated by shevegen (Robert A. Heiler) about 7 years ago Actions
Copy link
#1 [ruby-core:89299]

BTW: stdlib::CSV chokes on the BOM

I can't say how common this is or whether there is a bug; but in the event
that there may be, and the use case or situation involving the bug or faulty
behaviour affecting other ruby hackers, I would agree in this event that CSV
should probably be able to handle BOM-specific entries as well, in one way
or another (be it automatic or via another API).

I also agree that it could perhaps be mentioned somewhere, be it in the
csv documentation or elsewhere.

To the workaround: I assume you meant this only for a solution if others face
a similar problem, rather than a permanent addition to class String, yes?
(I ask this because adding a specific method to class String permanently in
ruby may be much harder to do and get approved, whereas an extension to ruby's
CSV is most likely easier and possible.)

Updated by nobu (Nobuyoshi Nakada) about 7 years ago Actions
Copy link
#2 [ruby-core:89300]

Description updated (diff)
Assignee set to 13939

foonlyboy (Eike Dierks) wrote:

I believe this to be a bug in how byte data is converted to the ruby internal String representation.

Yes, a BOM should be removed at the conversion, the reading from a data stream.

There is a workaround, but this needs to be documented:
IO.read(mode:'r:BOM|UTF-8') 

It is documented at IO.new, and you can use it at CSV.open too.

rdoc of CSV.open:

You must pass a filename and may optionally add a mode for Ruby's open().

rdoc of Kernel.open:

See the documentation of IO.new for full documentation of the mode string directives.

rdoc of IO.new:

If "BOM|UTF-8", "BOM|UTF-16LE" or "BOM|UTF16-BE" are used, Ruby checks for
a Unicode BOM in the input document to help determine the encoding. For
UTF-16 encodings the file open mode must be binary. When present, the BOM
is stripped and the external encoding from the BOM is used. When the BOM
is missing the given Unicode encoding is used as ext_enc. (The BOM-set
encoding option is case insensitive, so "bom|utf-8" is also valid.)

Documents improvement patches are welcome.

But I'm asking for to improve the UTF-BOM handling:

The BOM is only used for transfer encoding at the byte stream level.

This is half true.

https://en.wikipedia.org/wiki/Byte_order_mark#Usage

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"

The character at other place is not called as "BOM".

The BOM MUST NOT be part of the String in internal representation.

Yes, it should be removed at the reading, that is the only chance to remove a BOM properly.

Updated by foonlyboy (Eike Dierks) about 7 years ago Actions
Copy link
#3 [ruby-core:89391]

I looked into it a bit more closely into it:

io.c does this in

static int io_strip_bom(VALUE io)

which is called by:

static void io_set_encoding_by_bom(VALUE io)

It is documented at IO.new, and you can use it at CSV.open too.
Yes, I was aware of this.

I also agree the the conversion has to take place at opening the file.

But with rails I get a ActionDispatch::Http::UploadedFile
(which returns an ASCII-8BIT byte stream)

And I could find no way to apply the io_strip_bom() to it,
not even by going through StringIO.
(but then Ruby is not about applying tricks anyway)

It sounds to me that nobu also agrees, that the BOM should always be removed.

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space"

I don't care so much about this for now.
(while I can imagine this to happen when concatenating files ...)

But let's fix the more simple problems first.

I think the BOM is used for two reasons in byte streams:

a magic number for UTF encoded data (which might even apply to UTF-8)
a magic number to distinguish different UTF byte orderings when using UTF-16, UTF-32, UTF-36?

But in the ruby world, we have String
We should remove all artefacts from any external encoding.

Impact:

I believe this might need a lot of changes throughout more than just one place in the code,
but I believe this should be fully upward compatible with most customers code.

This should still agree with the ruby spec,
because nowhere was it ever declared that String keeps the BOM.

Please excuse my lengthy writings,
but I thought these encoding problems were a thing from the past.

We might also look at the other languages around.
Makes for a good rosetta code ...

~eike

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#4 [ruby-core:93095]

https://github.com/nobu/ruby/pull/new/feature/15210-detect_bom

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#5 [ruby-core:93098]

Renamed and an exception at unexpected condition.
https://github.com/nobu/ruby/pull/new/feature/15210-set_encoding_by_bom

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#6

Status changed from Open to Closed

Applied in changeset git|e717d6faa8463c70407e6aaf116c6b6181f30be6.

IO#set_encoding_by_bom

io.c (rb_io_set_encoding_by_bom): IO#set_encoding_by_bom to set
the encoding by BOM if exists. [Bug #15210]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#7

Related to Bug #15908: Detecting BOM with non-UTF encoding added

Project

General

Profile

Ruby

Tags

Custom queries

Bug #15210

UTF-8 BOM should be removed from String in internal representation

Updated by shevegen (Robert A. Heiler) about 7 years ago Actions
Copy link
#1 [ruby-core:89299]

Updated by nobu (Nobuyoshi Nakada) about 7 years ago Actions
Copy link
#2 [ruby-core:89300]

Updated by foonlyboy (Eike Dierks) about 7 years ago Actions
Copy link
#3 [ruby-core:89391]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#4 [ruby-core:93095]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#5 [ruby-core:93098]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#6

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#7

Project

General

Profile

Ruby

Tags

Custom queries

Bug #15210

UTF-8 BOM should be removed from String in internal representation

Updated by shevegen (Robert A. Heiler) about 7 years ago ActionsCopy link #1 [ruby-core:89299]

Updated by nobu (Nobuyoshi Nakada) about 7 years ago ActionsCopy link #2 [ruby-core:89300]

Updated by foonlyboy (Eike Dierks) about 7 years ago ActionsCopy link #3 [ruby-core:89391]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago ActionsCopy link #4 [ruby-core:93095]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago ActionsCopy link #5 [ruby-core:93098]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago ActionsCopy link #6

Updated by nobu (Nobuyoshi Nakada) over 6 years ago ActionsCopy link #7

Updated by shevegen (Robert A. Heiler) about 7 years ago Actions
Copy link
#1 [ruby-core:89299]

Updated by nobu (Nobuyoshi Nakada) about 7 years ago Actions
Copy link
#2 [ruby-core:89300]

Updated by foonlyboy (Eike Dierks) about 7 years ago Actions
Copy link
#3 [ruby-core:89391]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#4 [ruby-core:93095]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#5 [ruby-core:93098]

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#6

Updated by nobu (Nobuyoshi Nakada) over 6 years ago Actions
Copy link
#7