generate small/fast table for changes_when_{uppercased,lowercased} #4
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
This a rough pr with finished design/impl but the code is just a mess1, and I wasn't really going to PR this, but here it is . The output combined all into one playground can be seen here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3568421c224d239f574e1eec7a964381. It has a test which shows answers both queries correclty for every char.
The basic notes are:
I could probably get it faster than this but it's already overengineered enough.
The overall approach is to store a list of ranges in a table, and binary search that. The ranges might indicate:
U+00c0..=U+00d6only change under lowercase (but not uppercase)upper,lower,upper,lower,upper,lower(or similarlylower,upper,lower,upper,...). This is very common in Unicode, and special casing this is why the is only 200ish ranges (and not over 1000, most of them for a single character)Runs with lengths that dont fit into 8 bits are split into multiple contiguous smaller ones, and then it's encoded as u32, as
MSB[21 bit start_char | 3 bit range type | 8 bit length]LSB.This seems likely to work indefinitely since 21 bits can fit any char, and we dont use all the values for the 3 bit range type. That said, it probably won't have to change.
The generator uses a greedyish algorithm to categorize every character into a range. It then filters out ASCII and "no changes" ranges, splits them up (with minimal cleanup) and encodes each range into a u32.
I started this a while ago, then forgot, then came back and banged it all out this weekend. It's a problem space near/dear to my hear tho -- ive spent an unreasonable amount of time on making unicode tables better.
Re: #3 (comment)
Well, the size impact of this on generated binary is very small, and it runs fast too. But it's rather high complexity in the table generator. So the size impact in this repo is high. I don't mind owning/maintaining that, and I'm happy to get the code more cleaned up if you want, but I wouldn't be offended if you don't.
FWIW, it only relies on unicode stable promises (like the number of bits in a codepoint), and handles cases that could change (8 len bits not being enough). That means in theory future unicode updates should not come with any drama.
My general feeling is that with enough elbow grease you can get any of the unicode tables small. It's a compression (and data access) problem, just not a very well-studied one for whatever reason. I had a scratch workspace that could produce all tables regex needed with ~40kb data. It would have required a lot of changes to actually use, so I put it in my own regex engine, which never saw the light of day. So it goes.
Still, it's not ideal that everything has their own tables, but I don't see an alternative really. I don't want load them from the system, and stuff like icu4x feels like the wrong choice for code which cares about footprint.
Footnotes
Like, it's a huge mess aside from proving the approach (the code has tons of debugging stuff and duplication, etc), I'd clean it up a lot if you were interested. ↩