Message 378225 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dennis Sweeney
Recipients	Dennis Sweeney, Zeturic, ammar2, josh.r, pmpp, serhiy.storchaka, tim.peters
Date	2020-10-08.09:09:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602148141.7.0.959963365329.issue41972@roundup.psfhosted.org>
In-reply-to

Content
Adding some hasty printf-debugging to fastsearch.h, I see this: >>> with open('data.bin', 'rb') as f: ... s = f.read() ... >>> base = 15403807 * b'\xff' >>> longer = base + b'\xff' >>> base in s Bloom has 0x00: 0 Bloom has 0xff: -2147483648 skip = 0 Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch. Candidate at 15403808 True >>> longer in s Bloom has 0x00: No Bloom has 0xff: Yes skip = 0 Checking bloom filter for next char 0xff... found in pattern; increment by only one. Candidate at 1 Candidate at 2 Candidate at 3 Candidate at 4 Candidate at 5 Candidate at 6 Candidate at 7 (...) Trying the same base, I also get this: >>> with open('data.bin', 'rb') as f: ... s = f.read() ... >>> base = 15403807 * b'\xff' >>> s1 = s[1:] >>> base in s1 Bloom has 0x00: No Bloom has 0xff: Yes skip = 0 Checking bloom filter for next char 0xff... found in pattern; increment by only one. Candidate at 1 Candidate at 2 Candidate at 3 Candidate at 4 Candidate at 5 Candidate at 6 Candidate at 7 (...) So maybe part of the issue is just that the algorithm is getting particularly unlucky regarding how the strings line up to start with. The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm: we only skip to line up the second-to-last occurrence of the last character, but with these strings, that's just the second-to-last character. Computing the entire Boyer-Moore table would be better in cases like these, but that would require dynamic memory allocation (length of pattern) which I think is why it is avoided.

Adding some hasty printf-debugging to fastsearch.h, I see this: >>> with open('data.bin', 'rb') as f: ... s = f.read() ... >>> base = 15403807 * b'\xff' >>> longer = base + b'\xff' >>> base in s Bloom has 0x00: 0 Bloom has 0xff: -2147483648 skip = 0 Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch. Candidate at 15403808 True >>> longer in s Bloom has 0x00: No Bloom has 0xff: Yes skip = 0 Checking bloom filter for next char 0xff... found in pattern; increment by only one. Candidate at 1 Candidate at 2 Candidate at 3 Candidate at 4 Candidate at 5 Candidate at 6 Candidate at 7 (...) Trying the same base, I also get this: >>> with open('data.bin', 'rb') as f: ... s = f.read() ... >>> base = 15403807 * b'\xff' >>> s1 = s[1:] >>> base in s1 Bloom has 0x00: No Bloom has 0xff: Yes skip = 0 Checking bloom filter for next char 0xff... found in pattern; increment by only one. Candidate at 1 Candidate at 2 Candidate at 3 Candidate at 4 Candidate at 5 Candidate at 6 Candidate at 7 (...) So maybe part of the issue is just that the algorithm is getting particularly unlucky regarding how the strings line up to start with. The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm: we only skip to line up the second-to-last occurrence of the last character, but with these strings, that's just the second-to-last character. Computing the entire Boyer-Moore table would be better in cases like these, but that would require dynamic memory allocation (length of pattern) which I think is why it is avoided.

History
Date	User	Action	Args
2020-10-08 09:09:01	Dennis Sweeney	set	recipients: + Dennis Sweeney, tim.peters, pmpp, serhiy.storchaka, josh.r, ammar2, Zeturic
2020-10-08 09:09:01	Dennis Sweeney	set	messageid: <1602148141.7.0.959963365329.issue41972@roundup.psfhosted.org>
2020-10-08 09:09:01	Dennis Sweeney	link	issue41972 messages
2020-10-08 09:09:01	Dennis Sweeney	create