Issue 7139: ElementTree: Incorrect serialization of end-of-line characters in attribute values

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Created on 2009-10-15 06:21 by moriyoshi, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg94074 - (view)	Author: Moriyoshi Koizumi (moriyoshi)	Date: 2009-10-15 06:21
ElementTree doesn't correctly serialize end-of-line characters (#xa, #xd) in attribute values. Since bare end-of-line characters are converted to #x20 by the parser according to the specification [1], such characters that are represented as character references in the original document must be serialized in the same form. [1] http://www.w3.org/TR/xml11/#AVNormalize ### sample code from xml.etree.ElementTree import ElementTree from cStringIO import StringIO # builder = ElementTree(file=StringIO("<foo>\x0d</foo>")) # out = StringIO() # builder.write(out) # print out.getvalue() out = StringIO() ElementTree(file=StringIO( '''<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> <!ATTLIST foo attr CDATA ""> ]> <foo attr=" test test test a "> </foo> ''')).write(out) # should be "<foo attr=" test test test a ">\x0a</foo> print out.getvalue() out = StringIO() ElementTree(file=StringIO( '''<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> <!ATTLIST foo attr NMTOKENS ""> ]> <foo attr=" test test test a "> </foo> ''')).write(out) # should be "<foo attr="test test test a">\x0a</foo> print out.getvalue()
msg94077 - (view)	Author: Moriyoshi Koizumi (moriyoshi)	Date: 2009-10-15 07:39
Tabs must be converted to character references as well.
msg94833 - (view)	Author: Moriyoshi Koizumi (moriyoshi)	Date: 2009-11-02 16:12
Looks like a duplicate of #6492
msg94853 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-11-02 22:06
If I understood correctly, the correct behavior while reading is: * literal newlines (\n or \r) and tabs (\t) should be collapsed and converted to a space * newlines ( or ) and tabs ( ) as entities should be converted to the literal equivalents (\n, \r and \t) (See http://www.w3.org/TR/2000/WD-xml-c14n-20000119.html#charescaping) This should be ok in both xml.minidom and etree. Instead, while writing, if literal newlines and tabs are written as they are (\n, \r and \t), they can't be read during the parsing phase because they are collapsed and converted to a space. They should therefore be converted to entities ( , and ) automatically, but this could be incompatible with the current behavior (i.e. \n, \r or \t that now are written and collapsed as a space during the parsing will then become significant). Moriyoshi, can you confirm that what I said is correct and the problem is similar to the one described in #5752? I also closed #6492 as duplicate of this.
msg94855 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2009-11-02 22:27
The real problem here is that XML attributes weren't really designed to hold data that doesn't survive normalization. One would have thought that making it difficult to do that, and easy to store such things as character data, would have made people think a bit before designing XML formats that does things the other way around, but apparently some people finds it hard having to use their brain when designing things... FWIW, the current ET 1.3 beta escapes newline but not tabs and carriage returns; I don't really mind adding tabs, but I'm less sure about carriage return -- XML pretty much treats CT as a junk character also outside attributes, and escaping it in all contexts would just be silly.
msg95145 - (view)	Author: Moriyoshi Koizumi (moriyoshi)	Date: 2009-11-11 17:38
@ezio.melotti Yes, it works flawlessly as for parsing. Fixing this would actually break the current behavior, but I believe this is how it should work. It seems #5752 pretty much says the same thing. @effbot As specified in 2.11 End-of-Line Handling [2], any variants of EOL characters should have been normalized into single #xa before it actually gets parsed, so bare #xd characters would never appear as they are amongst parsed information items. [2] http://www.w3.org/TR/xml/#sec-line-ends
msg111540 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-25 13:00
Closed as a duplicate of #5752 which has patches attached.

History
Date	User	Action	Args
2022-04-11 14:56:54	admin	set	github: 51388
2010-07-25 13:00:54	BreamoreBoy	set	status: open -> closed versions: + Python 3.1, Python 3.2, - Python 2.6 nosy: + BreamoreBoy messages: + msg111540 resolution: duplicate
2009-11-11 17:38:12	moriyoshi	set	messages: + msg95145
2009-11-02 22:27:29	effbot	set	messages: + msg94855
2009-11-02 22:06:35	ezio.melotti	set	nosy: + ezio.melotti, devon messages: + msg94853 versions: + Python 2.7
2009-11-02 16:12:47	moriyoshi	set	messages: + msg94833
2009-10-15 07:39:04	moriyoshi	set	messages: + msg94077
2009-10-15 06:28:06	ezio.melotti	set	priority: normal assignee: effbot nosy: + effbot
2009-10-15 06:21:42	moriyoshi	set	title: Incorrect serialization of end-of-line characters in attribute values -> ElementTree: Incorrect serialization of end-of-line characters in attribute values
2009-10-15 06:21:29	moriyoshi	create