Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. #102

SavageTiger · 2021-04-03T00:25:36Z

Description

While investigating if I could improve the performance for this PR #101 I stumbled upon two bottlenecks.

1. The html to word parser was a big for loop that walked over every character individually

The code was really complex, and hard to comprehend. It was a huge loop with loads of code-flow inside that was affected by what character was being processed, what character was processed previously, and what process was comming up.

I replace it by mostly regex parsing that does the heavy lifting, speeding this method up by 98%

Old
New

2. Whitespace checking was resource intensive

While finding blocks we have todo a whitespace check of part of the old sentence, this is done loads of times. This is done by referencing the oldWords array, and temporarily making a string from part of that array, and then checking if that string is only whitespace.

A couple of years ago I already added caching here, to speed up the algorithm allot, but I took some more time to investigate if this can be improved further.

I have replaced this by a loop that iterates over the part of the sentence item by item, and when one of the items is not a space (usually the first item), it immediately reports false and caching the result, speeding up the algorithm in some cases by up to 50%

Old
New

Added typehints, strict type checking, and removed else code-flow

While finding blocks we have todo a whitespace check of part of the old sentence, this is done loads of times. This is done by refrerencing the oldWords array, and temporarely making a string from part of that array, and then checking if that string is only whitespace. I have replaced this by a loop that iterates over the part of the sentence item by item, and when one of the items is not a space (usually the first item), it immediately reports false and caching the result, speeding up the algorithm in some cases by up to 50%

The old parser split up the html into a big string of characters, and then going over the string parsing it character by character, using a big complicated switch statement, and loads of if statements. Most of this has been replaced by regular expression parsing greatly simplifying the code, and reducing the execution time by 98%

SavageTiger mentioned this pull request Apr 3, 2021

Some diffs take far too long even with no multi-byte #101

Closed

SavageTiger force-pushed the sven/feature/html_diff_update branch 2 times, most recently from a81629e to 317fdba Compare April 5, 2021 15:10

Sven Hagemann added 4 commits April 5, 2021 22:23

Cleanup of unused local properties

89a3a61

Refactored indexNewWords

0a10375

Added typehints, strict type checking, and removed else code-flow

Cleanup: Fix indenting in operations switch

ad70b4f

SavageTiger force-pushed the sven/feature/html_diff_update branch from 59073d6 to 7ef0cd5 Compare April 5, 2021 20:35

pull-request-size bot added the size/L label Apr 5, 2021

SavageTiger force-pushed the sven/feature/html_diff_update branch 2 times, most recently from 6cc072f to 7ef0cd5 Compare April 5, 2021 20:57

SavageTiger force-pushed the sven/feature/html_diff_update branch from 7ef0cd5 to a16330a Compare April 5, 2021 20:58

SavageTiger changed the title ~~WIP: Rebuilding the word parser for better performance~~ Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. Apr 5, 2021

SavageTiger merged commit 08e8a6d into master Apr 5, 2021

SavageTiger deleted the sven/feature/html_diff_update branch April 9, 2021 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. #102

Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. #102

Uh oh!

SavageTiger commented Apr 3, 2021 •

edited

Loading

Labels

1 participant

Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. #102

Performance enhancements. Rebuild the word parser and replace the whitespace checker in the match finder. #102

Uh oh!

Conversation

SavageTiger commented Apr 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. The html to word parser was a big for loop that walked over every character individually

2. Whitespace checking was resource intensive

Labels

1 participant

SavageTiger commented Apr 3, 2021 •

edited

Loading