- Notifications
You must be signed in to change notification settings - Fork 85
Improve using //
in XPath performance #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5bf858e
to 119ed65
Compare I have some concerns about maintainability because cache invalidation bug is hard to debug. Here are some ideas of cache strategy.
def node_test namespaces_cache = {} # Cache scope is enclosed in this method. # There is no need to worry about cache invalidation. call node.namespaces(namespaces_cache) to use the cache end def namespaces(cache = nil) if cache cache[self] ||= calculate_namespaces(cache) else calculate_namespaces end end
def namespaces parent_namespaces = parent&.namespaces # If parent.namespaces or @attributes is changed, cache is automatically invalidated return @namespaces if parent_namespaces.equal?(@parent_namespaces) && @attrubutes.frozen? @attributes.freeze # Need to unfreeze before modifying @attributes @parent_namespaces = parent_namespaces @namespaces = @parent_namespaces.merge(@attributes).freeze end |
119ed65
to 34cca4c
Compare
For maintainability, I have changed my mind about adopting policy 1 above. |
I have a bit concern about adding a I feel that it may a bit ad-hoc approach. I'm not object this approach but can we improve our API? For example, how about keeping a cache per document? class XPathParser def parse path, nodeset path_stack = @parser.parse( path ) nodeset.first.document.enable_cache do # New document level cache is created and available in this block. # This API is thread unsafe. Users can't change this document in this block. match( path_stack, nodeset ) end end end |
34cca4c
to cea9bb8
Compare I add |
cea9bb8
to ffe6f9a
Compare I have corrected the points you pointed out. |
ffe6f9a
to 24cfc1b
Compare I have corrected the points you pointed out. |
ff07142
to acf7ecf
Compare I have corrected the points you pointed out. |
#252) `XPath.match`, `XPath.first`, `XPath.each`, `XPathParser#parse` and `XPathParser#match` accepted nodeset as element. This pull request changes the first parameter of these method to be an element instead of nodeset. Passing nodeset will be deprecated. ```ruby # Documented usage. OK REXML::XPath.match(element, xpath) # Undocumented usage. Deprecate in this pull request nodeset = [element] REXML::XPath.match(nodeset, xpath) ``` ### Background #249 will introduce a temporary cache. ```ruby def parse path, nodeset path_stack = @parser.parse( path ) nodeset.first.document.send(:enable_cache) do match( path_stack, nodeset ) end end ``` But the signature `XPathParser#match(path, nodeset)` does not guarantee that all nodes in the nodeset has the same root document. So cache does not work in the code below. It's still slow. ```ruby REXML::XPath.match(2.times.map { REXML::Document.new('<a>'*400+'</a>'*400) }, 'a//a') ``` The interface is holding our back, so I propose to drop accepting array as element. This change is a backward incompatibility, but it just drops undocumented feature. I think only the test code was unintentionally using this feature. ### XPath.match with array XPath.match only traverse the first element of the array for some selectors. ```ruby nodeset = [REXML::Document.new("<a><b/></a>"), REXML::Document.new("<a><c/></a>")] REXML::XPath.match(nodeset, "a/*") #=> [<b/>, <c/>] REXML::XPath.match(nodeset, "//a/*") #=> [<b/>] # I expect [<b/>, <c/>] but the second document is ignored ``` It indicates that `XPath.match` is not designed to search inside multiple nodes/documents. --------- Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
acf7ecf
to 7279481
Compare #252 merged, so I rebased this PR and updated. |
7279481
to 88ec36f
Compare I have corrected the points you pointed out. |
88ec36f
to 1f67fbd
Compare I have corrected the points you pointed out. |
1f67fbd
to b28407e
Compare I have corrected the points you pointed out. |
## Why? for fix get_namespace performance > # FIXME: This DOUBLES the time XPath searches take Co-authored-by: tomoya ishida <tomoyapenguin@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com>
## Benchmark (Comparison with rexml 3.4.1) ``` $ benchmark-driver benchmark/xpath.yaml Calculating ------------------------------------- rexml 3.4.1 master 3.4.1(YJIT) master(YJIT) REXML::XPath.match(REXML::Document.new(xml), 'a//a') 29.215 234.909 108.945 492.410 i/s - 100.000 times in 3.422925s 0.425697s 0.917898s 0.203083s Comparison: REXML::XPath.match(REXML::Document.new(xml), 'a//a') master(YJIT): 492.4 i/s master: 234.9 i/s - 2.10x slower 3.4.1(YJIT): 108.9 i/s - 4.52x slower rexml 3.4.1: 29.2 i/s - 16.85x slower ``` - YJIT=ON : 4.52x faster - YJIT=OFF : 8.04x faster
b28407e
to 6c573e0
Compare I have corrected the points you pointed out. |
Thanks. |
When using
//
in XPath, the deeper the tag hierarchy, the slower it becomes due to the namespace acquisition process.Caching namespace information improves performance when using
//
with XPath.Benchmark (Comparison with rexml 3.4.1)