Skip to content

Conversation

@sbuerk
Copy link
Contributor

@sbuerk sbuerk commented Apr 2, 2024

[BUGFIX] Respect language based style names on reading Word files

Microsoft Office saves Office document with language based style
mappings for default styles. For example, if a german based Word
version is used, it writes following to the word/styles.xml in
the container archive (*.docs):

<w:style w:type="paragraph" w:styleId="berschrift1"> <w:name w:val="heading 1"/> .... </w:style> 

versus for a english based version it would be:

<w:style w:type="paragraph" w:styleId="Heading1"> <w:name w:val="heading 1"/> ... </w:style> 

The value of <w:name /> defines the internal native code
identifier, whereas the w:styleId attribute on the outer
<w:style /> tag would describe the virtual or alias name.

Later parsing of the document structure, for example the
paragraphs, references the alias (w:styleId) name of a
style. The reader code uses hardcoded RegEx matchings in
a case-insensitive manner but using the englisch speaking
variant (Header\s+d) - on the language based one, which
would not match at all.

Therefore, multiple tasks need to be done and contained
in this change:

  • A alias map is implementend and used to register title
    aliases. Along with this corresponding lookup method is
    added.
  • Use the lookup method to resolve for alias where the
    hardcoded language RegEx is needed to be used.
  • Gathering all style alias names during reading the
    wordfile styles settings for all possible styles.
@coveralls
Copy link

coveralls commented Apr 2, 2024

Coverage Status

coverage: 97.171% (-0.05%) from 97.217%
when pulling 13a5d65 on sbuerk:stefan-1
into 8b891bb on PHPOffice:master.

Microsoft Office saves Office document with language based style mappings for default styles. For example, if a german based Word version is used, it writes following to the `word/styles.xml` in the container archive (*.docs): ``` <w:style w:type="paragraph" w:styleId="berschrift1"> <w:name w:val="heading 1"/> .... </w:style> ``` versus for a english based version it would be: ``` <w:style w:type="paragraph" w:styleId="Heading1"> <w:name w:val="heading 1"/> ... </w:style> ``` The value of `<w:name />` defines the internal native code identifier, whereas the `w:styleId` attribute on the outer `<w:style />` tag would describe the virtual or alias name. Later parsing of the document structure, for example the paragraphs, references the alias (`w:styleId`) name of a style. The reader code uses hardcoded RegEx matchings in a case-insensitive manner but using the englisch speaking variant (`Header\s+d`) - on the language based one, which would not match at all. Therefore, multiple tasks need to be done and contained in this change: * A alias map is implementend and used to register title aliases. Along with this corresponding lookup method is added. * Use the lookup method to resolve for alias where the hardcoded language RegEx is needed to be used. * Gathering all style alias names during reading the wordfile styles settings for all possible styles.
@Progi1984
Copy link
Member

@sbuerk Have you got a sample file with 🇩🇪 styles ?

@Progi1984 Progi1984 added the Status: Waiting for feedback Question has been asked, waiting for response from PR author label Aug 15, 2024
@sbuerk
Copy link
Contributor Author

sbuerk commented Oct 11, 2024

@sbuerk Have you got a sample file with 🇩🇪 styles ?

Sorry, was kind of busy with mainting other open source stuff, and will be until thusday. Will try to update this and my other pr's in the next 2 weeks, sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Waiting for feedback Question has been asked, waiting for response from PR author

3 participants