Skip to content

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Feb 17, 2024

Why?

Improve maintainability by optimizing the process so that the parsing process proceeds using StringScanner#scan.

Changed

  • Change REXML::Parsers::BaseParser from frozen_string_literal: false to frozen_string_literal: true.
  • Added Source#string= method for error message output.
  • Added TestParseDocumentTypeDeclaration#test_no_name test case.
  • Of the intSubset of DOCTYPE, "<!" added consideration for processing Comments that begin with "<!".

[Benchmark]

RUBYLIB= BUNDLER_ORIG_RUBYLIB= /Users/naitoh/.rbenv/versions/3.3.0/bin/ruby -v -S benchmark-driver /Users/naitoh/ghq/github.com/naitoh/rexml/benchmark/parse.yaml ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin22] Calculating ------------------------------------- before after before(YJIT) after(YJIT) dom 11.240 10.569 17.173 18.219 i/s - 100.000 times in 8.896882s 9.461267s 5.823007s 5.488884s sax 31.812 30.716 48.383 52.532 i/s - 100.000 times in 3.143500s 3.255655s 2.066861s 1.903600s pull 36.855 36.354 56.718 61.443 i/s - 100.000 times in 2.713300s 2.750693s 1.763099s 1.627523s stream 34.176 34.758 49.801 54.622 i/s - 100.000 times in 2.925991s 2.877065s 2.008003s 1.830779s Comparison: dom after(YJIT): 18.2 i/s before(YJIT): 17.2 i/s - 1.06x slower before: 11.2 i/s - 1.62x slower after: 10.6 i/s - 1.72x slower sax after(YJIT): 52.5 i/s before(YJIT): 48.4 i/s - 1.09x slower before: 31.8 i/s - 1.65x slower after: 30.7 i/s - 1.71x slower pull after(YJIT): 61.4 i/s before(YJIT): 56.7 i/s - 1.08x slower before: 36.9 i/s - 1.67x slower after: 36.4 i/s - 1.69x slower stream after(YJIT): 54.6 i/s before(YJIT): 49.8 i/s - 1.10x slower after: 34.8 i/s - 1.57x slower before: 34.2 i/s - 1.60x slower 
  • YJIT=ON : 1.06x - 1.10x faster
  • YJIT=OFF : 0.94x - 1.01x faster
@naitoh naitoh marked this pull request as draft February 18, 2024 21:56
@naitoh naitoh force-pushed the use_string_scanner_if_document_status_is_nil_in_pull_event branch 3 times, most recently from c9a3bf2 to 8a24e85 Compare February 23, 2024 10:34
@naitoh naitoh changed the title Changed processing of @document_status == nil in REXML::Parsers::BaseParser#pull_event from regular expression to processing using StringScanner. Changed processing in REXML::Parsers::BaseParser#pull_event from regular expression to processing using StringScanner. Feb 23, 2024
@naitoh
Copy link
Contributor Author

naitoh commented Feb 23, 2024

@kou

Could you use this style?

I changed to that style and modified the whole REXML::Parsers::BaseParser#pull_event to use StringScanner.

We don't need to use existing regular expression constants such as DOCTYPE_START for it.
Because match unit will be different for the current approach and StringScanner based approach.

$ cat benchmark/stringscan_1.yaml loop_count: 100000 contexts: - name: No YJIT prelude: | $LOAD_PATH.unshift(File.expand_path("lib")) require 'rexml' prelude: | require 'strscan' s = StringScanner.new('abcdefg hijklmn opqrstu vwxyz') ext = '' ptn_ext = /[a-z]*#{ext}/ ptn_no_ext = /[a-z]*/ benchmark: 'check(/[a-z]*/)' : s.check(/[a-z]*/) 'ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext)' : | ptn_no_ext=/[a-z]*/ s.check(ptn_no_ext) 'check(ptn_no_ext)' : s.check(ptn_no_ext) 'check(/[a-z]*#{ext}/)' : s.check(/[a-z]*#{ext}/) 'ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext)' : | ptn_ext=/[a-z]*#{ext}/ s.check(ptn_ext) 'check(ptn_ext)' : s.check(ptn_ext) 
$ benchmark-driver benchmark/stringscan_1.yaml Calculating ------------------------------------- check(/[a-z]*/) 8.215M i/s - 100.000k times in 0.012173s (121.73ns/i) ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext) 8.033M i/s - 100.000k times in 0.012448s (124.48ns/i) check(ptn_no_ext) 8.183M i/s - 100.000k times in 0.012221s (122.21ns/i) check(/[a-z]*#{ext}/) 948.623k i/s - 100.000k times in 0.105416s (1.05μs/i) ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext) 958.653k i/s - 100.000k times in 0.104313s (1.04μs/i) check(ptn_ext) 8.180M i/s - 100.000k times in 0.012225s (122.25ns/i) Comparison: check(/[a-z]*/): 8214901.3 i/s check(ptn_no_ext): 8182636.3 i/s - 1.00x slower check(ptn_ext): 8179959.0 i/s - 1.00x slower ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext): 8033419.2 i/s - 1.02x slower ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext): 958653.3 i/s - 8.57x slower check(/[a-z]*#{ext}/): 948622.6 i/s - 8.66x slower 

The format check(/[a-z]*#{ext}/) is too slow.

For this reason, I do not use @source.match( /((?>#{QNAME_STR}))/um, true), but @source.match(TAG_MATCH, true).

$ cat benchmark/stringscan_2.yaml loop_count: 100000 contexts: - name: No YJIT prelude: | $LOAD_PATH.unshift(File.expand_path("lib")) require 'rexml' prelude: | require 'strscan' s = StringScanner.new('abcdefg hijklmn opqrstu vwxyz') ext = '' ptn_ext = "a#{ext}" ptn_no_ext = "a" benchmark: 'check("a")' : s.check("a") 'ptn_no_ext="a";s.check(ptn_no_ext)' : | ptn_no_ext="a" s.check(ptn_no_ext) 'check(ptn_no_ext)' : s.check(ptn_no_ext) 'check("a#{ext}")' : s.check("a#{ext}") 'ptn_ext="a#{ext}"; s.check(ptn_ext)' : | ptn_ext="a#{ext}" s.check(ptn_ext) 'check(ptn_ext)' : s.check(ptn_ext) 
$ benchmark-driver benchmark/stringscan_2.yaml Calculating ------------------------------------- check("a") 10.321M i/s - 100.000k times in 0.009689s (96.89ns/i) ptn_no_ext="a";s.check(ptn_no_ext) 9.878M i/s - 100.000k times in 0.010123s (101.23ns/i) check(ptn_no_ext) 13.441M i/s - 100.000k times in 0.007440s (74.40ns/i) check("a#{ext}") 7.657M i/s - 100.000k times in 0.013060s (130.60ns/i) ptn_ext="a#{ext}"; s.check(ptn_ext) 7.522M i/s - 100.000k times in 0.013295s (132.95ns/i) check(ptn_ext) 13.472M i/s - 100.000k times in 0.007423s (74.23ns/i) Comparison: check(ptn_ext): 13471642.3 i/s check(ptn_no_ext): 13440859.7 i/s - 1.00x slower check("a"): 10320982.6 i/s - 1.31x slower ptn_no_ext="a";s.check(ptn_no_ext): 9878495.0 i/s - 1.36x slower check("a#{ext}"): 7656967.4 i/s - 1.76x slower ptn_ext="a#{ext}"; s.check(ptn_ext): 7521624.7 i/s - 1.79x slower 

check("a") is slower than check(ptn_no_ext).
For this reason I use the form @source.match(TAG_START, true, false).

@naitoh naitoh marked this pull request as ready for review February 23, 2024 11:34
@naitoh naitoh requested a review from kou February 23, 2024 11:34
@kou
Copy link
Member

kou commented Feb 23, 2024

Comparison: check(/[a-z]*/): 8214901.3 i/s check(ptn_no_ext): 8182636.3 i/s - 1.00x slower check(ptn_ext): 8179959.0 i/s - 1.00x slower ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext): 8033419.2 i/s - 1.02x slower ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext): 958653.3 i/s - 8.57x slower check(/[a-z]*#{ext}/): 948622.6 i/s - 8.66x slower 

I think that Regexp creation is the reason. ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext) and check(/[a-z]*#{ext}/) create a new Regexp because it has string (regular expression?) interpolation. We can avoid it by /.../o (o option).

@kou
Copy link
Member

kou commented Feb 23, 2024

Comparison: check(ptn_ext): 13471642.3 i/s check(ptn_no_ext): 13440859.7 i/s - 1.00x slower check("a"): 10320982.6 i/s - 1.31x slower ptn_no_ext="a";s.check(ptn_no_ext): 9878495.0 i/s - 1.36x slower check("a#{ext}"): 7656967.4 i/s - 1.76x slower ptn_ext="a#{ext}"; s.check(ptn_ext): 7521624.7 i/s - 1.79x slower 

I think that String creation is the reason. Items except check(ptn_ext) and check(ptn_no_ext) creates a new String in each execution.

@kou
Copy link
Member

kou commented Feb 23, 2024

(I'll review implementation later...)

@naitoh
Copy link
Contributor Author

naitoh commented Feb 24, 2024

Comparison: check(/[a-z]*/): 8214901.3 i/s check(ptn_no_ext): 8182636.3 i/s - 1.00x slower check(ptn_ext): 8179959.0 i/s - 1.00x slower ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext): 8033419.2 i/s - 1.02x slower ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext): 958653.3 i/s - 8.57x slower check(/[a-z]*#{ext}/): 948622.6 i/s - 8.66x slower 

I think that Regexp creation is the reason. ptn_ext=/[a-z]*#{ext}/; s.check(ptn_ext) and check(/[a-z]*#{ext}/) create a new Regexp because it has string (regular expression?) interpolation. We can avoid it by /.../o (o option).

Comparison: check(ptn_no_ext): 8259002.0 i/s check(ptn_ext): 8211528.9 i/s - 1.01x slower check(/[a-z]*/): 8206138.1 i/s - 1.01x slower check(/[a-z]*#{ext}/o): 8198737.4 i/s - 1.01x slower ptn_no_ext=/[a-z]*/; s.check(ptn_no_ext): 8000640.1 i/s - 1.03x slower ptn_ext=/[a-z]*#{ext}/o; s.check(ptn_ext): 7968762.5 i/s - 1.04x slower 

I found a workaround with /... /o.
Thanks!

@naitoh
Copy link
Contributor Author

naitoh commented Feb 24, 2024

Comparison: check(ptn_ext): 13471642.3 i/s check(ptn_no_ext): 13440859.7 i/s - 1.00x slower check("a"): 10320982.6 i/s - 1.31x slower ptn_no_ext="a";s.check(ptn_no_ext): 9878495.0 i/s - 1.36x slower check("a#{ext}"): 7656967.4 i/s - 1.76x slower ptn_ext="a#{ext}"; s.check(ptn_ext): 7521624.7 i/s - 1.79x slower 

I think that String creation is the reason. Items except check(ptn_ext) and check(ptn_no_ext) creates a new String in each execution.

Comparison: check(ptn_ext): 13609145.2 i/s check("a".freeze): 13592498.0 i/s - 1.00x slower check(ptn_no_ext): 13486176.9 i/s - 1.01x slower check("a"): 10145074.2 i/s - 1.34x slower ptn_no_ext="a";s.check(ptn_no_ext): 10034116.0 i/s - 1.36x slower check("a#{ext}"): 7734550.3 i/s - 1.76x slower ptn_ext="a#{ext}"; s.check(ptn_ext): 7601094.5 i/s - 1.79x slower 

I found a workaround with "".freeze.
Thanks!

@naitoh naitoh force-pushed the use_string_scanner_if_document_status_is_nil_in_pull_event branch from 8a24e85 to a7c9491 Compare February 24, 2024 06:02
## Why? Because `s.check("a")` is slower than `s.check("a".freeze)`. - benchmark/stringscan_2.yaml ``` loop_count: 100000 contexts: - name: No YJIT prelude: | $LOAD_PATH.unshift(File.expand_path("lib")) require 'rexml' prelude: | require 'strscan' s = StringScanner.new('abcdefg hijklmn opqrstu vwxyz') ptn = "a" benchmark: 'check("a")' : s.check("a") 'check("a".freeze)' : s.check("a".freeze) 'ptn="a";s.check(ptn)' : | ptn="a" s.check(ptn) 'check(ptn)' : s.check(ptn) ``` ``` $benchmark-driver benchmark/stringscan_2.yaml Comparison: check(ptn): 13524479.4 i/s check("a".freeze): 13433638.1 i/s - 1.01x slower check("a"): 10231225.8 i/s - 1.32x slower ptn="a";s.check(ptn): 10013017.0 i/s - 1.35x slower ```
@naitoh naitoh force-pushed the use_string_scanner_if_document_status_is_nil_in_pull_event branch from a7c9491 to 9a075bf Compare February 24, 2024 06:27
@naitoh
Copy link
Contributor Author

naitoh commented Feb 24, 2024

  • "".freeze was handled with frozen_string_literal: true.
  • For reasons unknown, /... /o is degrading performance in a YJIT environment, so I would like it to remain as it is.
--- a/lib/rexml/parsers/baseparser.rb (9a075bf) +++ b/lib/rexml/parsers/baseparser.rb (after) @@ -373,7 +373,7 @@ module REXML return process_instruction else # Get the next tag - md = @source.match(TAG_MATCH, true) + md = @source.match(/((?>#{QNAME_STR}))/umo, true) unless md @source.string = "<" + @source.buffer raise REXML::ParseException.new("malformed XML: missing tag start", @source)
RUBYLIB= BUNDLER_ORIG_RUBYLIB= /Users/naitoh/.rbenv/versions/3.3.0/bin/ruby -v -S benchmark-driver /Users/naitoh/ghq/github.com/naitoh/rexml/benchmark/parse.yaml ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin22] Calculating ------------------------------------- 9a075bf after 9a075bf(YJIT) after(YJIT) dom 11.391 11.456 17.511 17.488 i/s - 100.000 times in 8.778836s 8.728827s 5.710787s 5.718334s sax 31.562 31.574 50.602 45.577 i/s - 100.000 times in 3.168363s 3.167168s 1.976219s 2.194067s pull 37.249 37.164 63.066 57.618 i/s - 100.000 times in 2.684616s 2.690755s 1.585646s 1.735582s stream 35.675 35.734 55.543 50.465 i/s - 100.000 times in 2.803056s 2.798452s 1.800413s 1.981582s Comparison: dom 9a075bf(YJIT): 17.5 i/s after(YJIT): 17.5 i/s - 1.00x slower after: 11.5 i/s - 1.53x slower 9a075bf: 11.4 i/s - 1.54x slower sax 9a075bf(YJIT): 50.6 i/s after(YJIT): 45.6 i/s - 1.11x slower after: 31.6 i/s - 1.60x slower 9a075bf: 31.6 i/s - 1.60x slower pull 9a075bf(YJIT): 63.1 i/s after(YJIT): 57.6 i/s - 1.09x slower 9a075bf: 37.2 i/s - 1.69x slower after: 37.2 i/s - 1.70x slower stream 9a075bf(YJIT): 55.5 i/s after(YJIT): 50.5 i/s - 1.10x slower after: 35.7 i/s - 1.55x slower 9a075bf: 35.7 i/s - 1.56x slower 
@naitoh
Copy link
Contributor Author

naitoh commented Feb 25, 2024

In 8d7fc13, updated handling of document_status == :in_doctype to StringScanner style.

@naitoh naitoh force-pushed the use_string_scanner_if_document_status_is_nil_in_pull_event branch from 8d7fc13 to bc05d26 Compare February 25, 2024 23:34
@naitoh
Copy link
Contributor Author

naitoh commented Feb 25, 2024

Fixed what was pointed out.
And optimizations were made.

@naitoh naitoh requested a review from kou February 25, 2024 23:44
…lar expression to processing using StringScanner. ## Why Improve maintainability by optimizing the process so that the parsing process proceeds using StringScanner#scan. # Changed - Added Source#string= method for error message output. - Added TestParseDocumentTypeDeclaration#test_no_name test case. - Of the `intSubset` of DOCTYPE, "<!" added consideration for processing `Comments` that begin with "<!". [intSubset Spec] https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-doctypedecl > [28]	doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-intSubset > [28b] intSubset ::= (markupdecl | DeclSep)* https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-markupdecl > [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-elementdecl > [45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-AttlistDecl > [52]	AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EntityDecl > [70]	EntityDecl ::= GEDecl | PEDecl > [71]	GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' > [72]	PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-NotationDecl > [82]	NotationDecl ::= '<!NOTATION' S Name S (ExternalID | PublicID) S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI > [16]	PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Comment > [15]	Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-DeclSep > [28a] DeclSep ::= PEReference | S https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PEReference > [69] PEReference ::= '%' Name ';' [Benchmark] ``` RUBYLIB= BUNDLER_ORIG_RUBYLIB= /Users/naitoh/.rbenv/versions/3.3.0/bin/ruby -v -S benchmark-driver /Users/naitoh/ghq/github.com/naitoh/rexml/benchmark/parse.yaml ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin22] Calculating ------------------------------------- before after before(YJIT) after(YJIT) dom 11.240 10.569 17.173 18.219 i/s - 100.000 times in 8.896882s 9.461267s 5.823007s 5.488884s sax 31.812 30.716 48.383 52.532 i/s - 100.000 times in 3.143500s 3.255655s 2.066861s 1.903600s pull 36.855 36.354 56.718 61.443 i/s - 100.000 times in 2.713300s 2.750693s 1.763099s 1.627523s stream 34.176 34.758 49.801 54.622 i/s - 100.000 times in 2.925991s 2.877065s 2.008003s 1.830779s Comparison: dom after(YJIT): 18.2 i/s before(YJIT): 17.2 i/s - 1.06x slower before: 11.2 i/s - 1.62x slower after: 10.6 i/s - 1.72x slower sax after(YJIT): 52.5 i/s before(YJIT): 48.4 i/s - 1.09x slower before: 31.8 i/s - 1.65x slower after: 30.7 i/s - 1.71x slower pull after(YJIT): 61.4 i/s before(YJIT): 56.7 i/s - 1.08x slower before: 36.9 i/s - 1.67x slower after: 36.4 i/s - 1.69x slower stream after(YJIT): 54.6 i/s before(YJIT): 49.8 i/s - 1.10x slower after: 34.8 i/s - 1.57x slower before: 34.2 i/s - 1.60x slower ``` - YJIT=ON : 1.06x - 1.10x faster - YJIT=OFF : 0.94x - 1.01x faster Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@naitoh naitoh force-pushed the use_string_scanner_if_document_status_is_nil_in_pull_event branch from bc05d26 to 54b0298 Compare February 26, 2024 14:33
@naitoh
Copy link
Contributor Author

naitoh commented Feb 26, 2024

Fixed what was pointed out.

@naitoh naitoh requested a review from kou February 26, 2024 14:35
@kou kou changed the title Changed processing in REXML::Parsers::BaseParser#pull_event from regular expression to processing using StringScanner. Use more StringScanner based API to parse XML Feb 27, 2024
@kou kou merged commit 370666e into ruby:master Feb 27, 2024
@kou
Copy link
Member

kou commented Feb 27, 2024

Thanks!

@naitoh naitoh deleted the use_string_scanner_if_document_status_is_nil_in_pull_event branch February 27, 2024 01:28
naitoh added a commit to naitoh/rexml that referenced this pull request Mar 2, 2024
## Why? ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
naitoh added a commit to naitoh/rexml that referenced this pull request Mar 3, 2024
## Why? ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
naitoh added a commit to naitoh/rexml that referenced this pull request Mar 3, 2024
## Why? We want to just change scan pointer. ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
kou pushed a commit that referenced this pull request Mar 3, 2024
## Why? We want to just change scan pointer. #114 (comment) > I want to just change scan pointer (`StringScanner#pos=`) instead of changing `@scanner.string`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants