Skip to content

Conversation

@Bodigrim
Copy link
Contributor

Results

Benchmark results for GHC 8.10 can be found at https://gist.github.com/Bodigrim/365e388e080b17de45e80ab50a55fb4f. I'll publish a detailed analysis later, but here are the most notable improvements:

  • decodeUtf8 is up to 10x faster for non-ASCII texts.
  • encodeUtf8 is ~1.5-4x faster for strict and up to 10x faster for lazy Text.
  • take / drop / length are up to 20x faster.
  • toUpper / toLower are 10-30% faster.
  • Eq instance of strict Text is 10x faster.
  • Ord instance is typically 30%+ faster.
  • isInfixOf and search routines are up to 10x faster.
  • replicate of Char is up to 20x faster.

Geometric mean of benchmark times is 0.33.

How to review

I'd like to encourage as many reviewers as possible, and the branch is structured to facilitate it. Each commit builds and passes tests, so they can be reviewed individually. Actual switch from UTF16 to UTF8 happens in 1369cd3 and remaining TODOs resolved in 99f3c48. Everything else is performance improvements. There are two commits of intimidating size: 2cb3b30, which dumps autogenerated case mappings, and c3ccdb3, which bundles amalgamated simdutf8. I tried to keep other commits to a reasonable size and scope, hopefully it is palatable.

I'm happy to answer questions and add comments to any code, which is not clear enough. Do not hesitate to ask for this.

Known issues

This branch uses simdutf8 library, which is massive and written in C++. Since this is a bit inconvenient for a boot package, @kozross is currently finalizing a pure C replacement, which will be submitted for review in a separate PR.

@Fuuzetsu
Copy link
Member

Fuuzetsu commented Aug 22, 2021

I guess this also addresses #272 in one swoop

@chrisdone
Copy link
Member

What’s slower?

@Fuuzetsu
Copy link
Member

What’s slower?

I guess it'd be good to see what encodeUtf16 and friends give now. I suspect these are barely used compared to encodeUtf8 anyway.

@jberryman
Copy link

jberryman commented Aug 23, 2021

EDIT: Oops, I totally botched processing the CSV as Bodigrim points out below

I'll see if I can build https://github.com/hasura/graphql-engine with this branch and run it through our benchmarking infra this week.

EDIT: going back to the UTF-8 proposal it looks like for better or worse the only concrete acceptance criteria vis a vis performance are:

  • decodeUtf8 and encodeUtf8 become at least 2x faster.
  • Geometric mean of existing benchmarks (which favor UTF-16) decreases.
  • Fusion (as per our test suite) does not regress beyond at most several cases.

so above is more than acceptable according to that spec

@jkachmar
Copy link

jkachmar commented Aug 23, 2021

EDIT: Comment is no longer relevant to the discussion; see Bodigrim's response below.

@chrisdone
Copy link
Member

Thanks for the numbers @jberryman

@Bodigrim
Copy link
Contributor Author

@jberryman I'm afraid your table does not make sense:

Name master branch utf8 branch 0 Log Ratio
All.Programs.Throughput.LazyTextByteString 27378356325 4471250933 Ratio -0.787
All.Programs.Throughput.TextByteString 23929606012 4610553675 4.361 -0.715

First two rows are corrupted, so let's start from the third one. Here Ratio does not match time measurements: 4610553675 / 23929606012 = 0.193, which means that UTF8 branch is 5x faster, not 4.361x slower. Everything else is misrepresenting original data in the similar way.

Instead of sorting rows by Ratio, you sorted values in Ratio column on their own and reversed the table, apparently obtaining nonsensical results.

@jberryman
Copy link

jberryman commented Aug 23, 2021

Doh! Sorry

Click for regressions...
Name master branch utf8 branch Ratio Log Ratio
All.Pure.japanese.filter.filter.Text 4286460 18693165 4.361 0.640
All.Pure.russian.filter.filter.Text 6391499 26611977 4.164 0.619
All.Pure.russian.length.filter.Text 5990205 24360620 4.067 0.609
All.Pure.ascii.mapAccumL.Text 908974332800 3638772627300 4.003 0.602
All.Pure.japanese.length.filter.Text 3907379 14647059 3.749 0.574
All.Pure.japanese.length.words.Text 38836980 143547741 3.696 0.568
All.Pure.japanese.words.Text 39674886 140793437 3.549 0.550
All.Pure.russian.length.filter.filter.Text 6046403 21338005 3.529 0.548
All.Pure.russian.map.map.Text 25095436 85788868 3.419 0.534
All.Pure.english.mapAccumL.Text 60778409175 207472001800 3.414 0.533
All.Pure.ascii-small.intersperse.Text 477122164 1584540007 3.321 0.521
All.Pure.japanese.intersperse.Text 35325822 112014173 3.171 0.501
All.Pure.japanese.length.filter.filter.Text 3852873 11845036 3.074 0.488
All.Pure.ascii-small.mapAccumL.Text 524822893 1584027082 3.018 0.480
All.Pure.russian.intersperse.Text 51438771 154943289 3.012 0.479
All.Pure.japanese.map.map.Text 17463211 52084321 2.983 0.475
All.Pure.ascii.map.map.Text 234876768275 689893193000 2.937 0.468
All.Pure.english.map.map.Text 16816703650 48141679000 2.863 0.457
All.Pure.ascii-small.map.map.Text 268999974 769964654 2.862 0.457
All.Pure.english.mapAccumL.LazyText 47612369112 135492460800 2.846 0.454
All.Pure.ascii-small.mapAccumL.LazyText 492221926 1368635479 2.781 0.444
All.Pure.ascii.mapAccumR.Text 1360928562550 3743383120800 2.751 0.439
All.Pure.english.mapAccumR.Text 79957684125 209392756075 2.619 0.418
All.Pure.ascii-small.length.filter.Text 60110941 155777043 2.591 0.414
All.Pure.english.length.filter.Text 3518092281 9072480412 2.579 0.411
All.Pure.ascii.length.filter.Text 53715988100 134837148600 2.510 0.400
All.Pure.ascii-small.mapAccumR.Text 689645017 1708705487 2.478 0.394
All.Pure.ascii.filter.filter.Text 55365971900 135428844350 2.446 0.388
All.Pure.ascii-small.filter.filter.Text 64062812 156431460 2.442 0.388
All.Pure.ascii.concat.LazyText 25956179262 61828057150 2.382 0.377
All.Pure.ascii.mapAccumL.LazyText 892289471600 2104454784000 2.358 0.373
All.Pure.russian.mapAccumL.LazyText 40192719 94138204 2.342 0.370
All.Pure.english.filter.filter.Text 3861193850 8971767900 2.324 0.366
All.Pure.english.mapAccumR.LazyText 64822552500 140001104800 2.160 0.334
All.Pure.tiny.length.cons.Text 14363 30306 2.110 0.324
All.Pure.tiny.length.take.LazyText 29270 61375 2.097 0.322
All.Pure.ascii-small.mapAccumR.LazyText 677831689 1391476798 2.053 0.312
All.Pure.tiny.length.take.Text 17334 34548 1.993 0.300
All.Pure.tiny.length.drop.LazyText 31327 61538 1.964 0.293
All.FileIndices.Text 1722222759 3368145693 1.956 0.291
All.Pure.russian.mapAccumL.Text 44009081 85401880 1.941 0.288
All.Pure.russian.length.words.Text 59840591 113316215 1.894 0.277
All.FileRead.LazyText 18601989650 34579194700 1.859 0.269
All.Pure.ascii-small.length.filter.filter.Text 60108618 110940976 1.846 0.266
All.Pure.tiny.length.tail.Text 15947 29348 1.840 0.265
All.Pure.tiny.map.map.Text 40689 73561 1.808 0.257
All.Pure.ascii.length.filter.filter.Text 53382547300 96310759700 1.804 0.256
All.Pure.english.length.filter.filter.Text 3554604815 6396166743 1.799 0.255
All.Pure.tiny.length.replicate string.Text 17935 32192 1.795 0.254
All.Pure.tiny.length.drop.Text 22087 39054 1.768 0.248
All.Pure.tiny.take.LazyText 28916 50071 1.732 0.238
All.Pure.english.intersperse.Text 64852656850 110296260500 1.701 0.231
All.Pure.tiny.length.map.Text 17986 30326 1.686 0.227
All.Pure.russian.words.Text 63318701 106785640 1.686 0.227
All.Pure.english.words.Text 94305583550 156552526850 1.660 0.220
All.Pure.ascii.intersperse.Text 1097287378800 1807749568100 1.647 0.217
All.Pure.tiny.length.init.Text 16455 27004 1.641 0.215
All.Pure.ascii-small.words.Text 435580662 706163431 1.621 0.210
All.Programs.Throughput.LazyText 33153355300 53489191400 1.613 0.208
All.Pure.tiny.words.Text 35595 56988 1.601 0.204
All.Pure.tiny.length.words.Text 37474 59682 1.593 0.202
All.Pure.ascii.mapAccumR.LazyText 1337590373100 2128827338800 1.592 0.202
All.Programs.Throughput.Text 36894744925 58716190900 1.591 0.202
All.Pure.tiny.length.replicate char.Text 20409 31951 1.566 0.195
All.Pure.english.length.words.Text 24862166800 38848806662 1.563 0.194
All.Pure.japanese.mapAccumL.LazyText 31111157 48004810 1.543 0.188
All.Pure.russian.mapAccumR.Text 61982220 95396325 1.539 0.187
All.Pure.ascii.length.words.Text 380019569800 581904752000 1.531 0.185
All.Pure.russian.mapAccumR.LazyText 60163025 90158680 1.499 0.176
All.Pure.ascii-small.length.words.Text 431157632 641999207 1.489 0.173
All.FileRead.Text 22193181037 32356484450 1.458 0.164
All.Pure.tiny.length.init.LazyText 25821 37288 1.444 0.160
All.Pure.tiny.drop.LazyText 34048 48234 1.417 0.151
All.Programs.Fold 130703628900 185012830000 1.416 0.151
All.Pure.tiny.take.Text 18041 25449 1.411 0.149
All.Pure.tiny.length.tail.LazyText 22157 30972 1.398 0.145
All.ReadLines.Text 26481958375 36025864550 1.360 0.134
All.Pure.tiny.intersperse.Text 121332 163914 1.351 0.131
All.Pure.japanese.isInfixOf.LazyText 9485642 12734068 1.342 0.128
All.Pure.tiny.drop.Text 18699 24777 1.325 0.122
All.Pure.japanese.append.Text 3813036 5048982 1.324 0.122
All.FileIndices.LazyText 6163252356 8156591087 1.323 0.122
All.Pure.ascii-small.Builder.mappend char 40402773 53383941 1.321 0.121
All.Pure.japanese.Builder.mappend char 40523358 53078920 1.310 0.117
All.Pure.tiny.length.map.map.Text 23670 30874 1.304 0.115
All.Pure.tiny.length.filter.Text 17223 22367 1.299 0.113
All.Pure.russian.Builder.mappend char 40321784 52296500 1.297 0.113
All.Replace.LazyText 7591184037 9838982087 1.296 0.113
All.Pure.english.Builder.mappend char 40925067 52787836 1.290 0.111
All.Pure.russian.map.Text 70205178 90176283 1.284 0.109
All.Pure.tiny.Builder.mappend char 41598222 52693260 1.267 0.103
All.Pure.japanese.map.Text 48719205 61551012 1.263 0.102
All.Pure.ascii.Builder.mappend char 42090766 52697784 1.252 0.098
All.Pure.russian.foldl'.Text 46374687 57498525 1.240 0.093
All.Pure.english.tail.LazyText 313540 387425 1.236 0.092
All.Pure.ascii.words.LazyText 3200002919100 3951757023600 1.235 0.092
All.Pure.russian.reverse.Text 13685795 16806661 1.228 0.089
All.DecodeUtf8.ascii.strict decodeASCII 16597790097 20361800856 1.227 0.089
All.Pure.ascii.words.Text 1923375551450 2335346355100 1.214 0.084
All.DecodeUtf8.ascii.strict decodeLatin1 16669017806 20201149628 1.212 0.083
All.Pure.tiny.length.filter.filter.Text 17128 20717 1.210 0.083
All.Pure.russian.length.filter.filter.LazyText 111583214 134526054 1.206 0.081
All.DecodeUtf8.ascii.Strict 16353951638 19685246662 1.204 0.081
All.Stream.stream.Text 33265855475 39864031600 1.198 0.079
All.Pure.russian.isInfixOf.LazyText 8267631 9853384 1.192 0.076
All.Pure.russian.length.filter.LazyText 115133204 136725833 1.188 0.075
All.DecodeUtf8.ascii.strict decodeUtf8 16567841664 19555564587 1.180 0.072
All.Programs.Cut.Text 38600987350 45420349225 1.177 0.071
All.Pure.japanese.mapAccumL.Text 33773446 39456762 1.168 0.068
All.Pure.russian.zipWith.Text 155804762 180284474 1.157 0.063
All.Pure.japanese.mapAccumR.Text 44827642 51784380 1.155 0.063
All.Pure.japanese.foldl'.Text 32705090 37488457 1.146 0.059
All.Pure.russian.intersperse.LazyText 283186349 323150051 1.141 0.057
All.Pure.japanese.zipWith.Text 108932297 124234861 1.140 0.057
All.Pure.ascii-small.map.Text 756522332 859308795 1.136 0.055
All.Pure.tiny.map.LazyText 203443 229399 1.128 0.052
All.ReadNumbers.DecimalText 453009064 509126414 1.124 0.051
All.Pure.english.Builder.mappend 8 char 70987 79812 1.124 0.051
All.Pure.japanese.Builder.mappend 8 char 74208 83193 1.121 0.050
All.Pure.russian.length.words.LazyText 87385175 97790055 1.119 0.049
All.Programs.BigTable 175518625700 196159030400 1.118 0.048
All.Pure.russian.filter.LazyText 128635242 143446543 1.115 0.047
All.Pure.tiny.Builder.mappend 8 char 71272 79092 1.110 0.045
All.Pure.ascii.decode.Text 17054489931 18899973825 1.108 0.045
All.Builder.Int.Decimal.Show.12 185275 204834 1.106 0.044
All.Pure.ascii.length.intercalate.LazyText 70177439375 77541390000 1.105 0.043
All.Pure.japanese.concat.Text 9245860 10207330 1.104 0.043
All.Pure.tiny.length.cons.LazyText 28795 31707 1.101 0.042
All.Stream.stream.LazyText 47489877000 52205395000 1.099 0.041
All.Pure.ascii.decode'.Text 17238519398 18947115375 1.099 0.041
All.Pure.japanese.words.LazyText 54124018 59078675 1.092 0.038
All.Pure.russian.words.LazyText 102432943 110916035 1.083 0.035
All.Pure.russian.foldl'.LazyText 127752375 138417774 1.083 0.035
All.Pure.russian.Builder.mappend 8 char 73192 79158 1.082 0.034
All.Pure.tiny.uncons.LazyText 46043 49599 1.077 0.032
All.Pure.tiny.filter.LazyText 105763 113667 1.075 0.031
All.Pure.ascii-small.map.map.LazyText 1907625318 2048294912 1.074 0.031
All.Pure.japanese.length.words.LazyText 52596298 56397154 1.072 0.030
All.ReadNumbers.DoubleText 2877579064 3073777312 1.068 0.029
All.Pure.tiny.foldl'.Text 45520 48612 1.068 0.029
All.Pure.ascii-small.zipWith.Text 1690593403 1804527026 1.067 0.028
All.Pure.japanese.Builder.mappend text 417791315 445266578 1.066 0.028
All.Pure.russian.filter.filter.LazyText 116662979 124236636 1.065 0.027
All.Pure.japanese.map.LazyText 132459812 140812427 1.063 0.027
All.Pure.japanese.filter.LazyText 76294521 81085629 1.063 0.026
All.Pure.ascii-small.Builder.mappend 8 char 73099 77613 1.062 0.026
All.Pure.russian.map.LazyText 196094014 208047672 1.061 0.026
All.Pure.russian.map.map.LazyText 181665864 192183338 1.058 0.024
All.Pure.japanese.intersperse.LazyText 200228871 211215202 1.055 0.023
All.Pure.japanese.length.filter.LazyText 71222038 75014667 1.053 0.023

Removing the "japanese" and "russian" benchmarks doesn't change the picture significantly. Like @chrisdone I'd be interested in an assessment of the regressions from @Bodigrim or someone who knows the benchmarks well.

@Bodigrim
Copy link
Contributor Author

As I have written in the starting post, I'm working on a detailed analysis of performance.
@jberryman could you please wrap the table into a spoiler?

jberryman added a commit to jberryman/deferred-folds that referenced this pull request Aug 23, 2021
jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 23, 2021
@Bodigrim
Copy link
Contributor Author

Performance report

This report compares performance of text package with UTF-8-encoded internal representation (utf8 branch) to UTF-16 representation (master branch). Readers are encouraged to read the original proposal for UTF-8 transition first, especially "Performance impact" section, which provides necessary background.

The original Tom Harper's thesis, "Fusion on Haskell Unicode Strings" discusses performance differences between UTF-8, UTF-16 and UTF-32 encodings. Basically, UTF-32 offers the best performance for string operations in synthetic benchmarks, simple because it is a fixed-length encoding and parsing UTF-32-encoded buffer is no-op. Further, UTF-16 is worse, because characters can take 16 or 32 bits. However, parsing/printing codepoints is still very simple, there are only two branches, and since the vast majority of codepoints are 16-bits-long, CPU branch prediction works wonderfully.

However, UTF-8 performance poses certain challenges. Code points can be represented as 1, 2, 3 or 4 bytes, their parsing and printing involves multiple bitwise operations, and CPU branch prediction becomes ineffective, because non-ASCII texts constantly switch between branches. Memory savings from UTF-8 hardly affect synthetic benchmarks, because they usually fit into CPU cache. Synthetic benchmarks also do not account for encoding/decoding data from external sources (usually in UTF-8) and measure pure processing time. That's why text originally went for UTF-16 encoding.

The key to a better UTF-8 performance is to avoid parsing/printing of codepoints as much as possible. For example, the most common operations on text are cutting and concatenation - and neither of these requires actual parsing of individual characters, they can be executed over opaque memory buffers. Given that external sources are most likely UTF-8 encoded, an application can spend its entire lifetime without ever interpreting a text character-by-character - thus achieving a very decent performance.

Operations, which necessitate a character-by-character interpretation, are likely to regress with UTF-8. E. g., map succ, measured in isolation, could very well be faster on UTF-16 buffer. We argue that this is an acceptable tradeoff for two reasons. Firstly, if we measure a full pipeline including decoding an input file and encoding an output, savings from these no-ops are likely to outweigh map succ, unless there is a very long chain of map. Secondly, Unicode is so complex and mutifaceted that for practical applications parsing/printing of characters is trumped by application of f in map f. For instance, toUpper / toLower are actually faster in our utf8 branch than they were in master.


The original proposal stated three performance goals:

  • Fusion (as per our test suite) does not regress beyond at most several cases.
  • decodeUtf8 and encodeUtf8 become at least 2x faster.
  • Geometric mean of existing benchmarks (which favor UTF-16) decreases.

While the work on UTF-8 transition was underway, text package decided to abandon implicit fusion framework, as it was demonstrated to harm asymptotic (!) performance (#348). FWIW early drafts of utf8 branch, prior to the 21st of June, showed no issues with fusion, e. g., https://github.com/Bodigrim/text/commits/utf8-210609.

Next, here are results for encodeUtf8:

git checkout master cabal run text-benchmarks -- -t100 -p encode --csv text-master-encode.csv git checkout utf8 cabal run text-benchmarks -- -t100 -p encode --baseline text-master-encode.csv
tiny encode Text: 18.3 ns ± 202 ps, 58% faster than baseline LazyText: 33.1 ns ± 364 ps, 61% faster than baseline ascii-small encode Text: 4.80 μs ± 58 ns, 55% faster than baseline LazyText: 7.57 μs ± 71 ns, 86% faster than baseline ascii encode Text: 7.17 ms ± 40 μs, 60% faster than baseline LazyText: 78.1 ms ± 672 μs, 32% faster than baseline english encode Text: 347 μs ± 5.5 μs, 44% faster than baseline LazyText: 510 μs ± 4.4 μs, 81% faster than baseline russian encode Text: 1.36 μs ± 24 ns, 88% faster than baseline LazyText: 1.37 μs ± 13 ns, 92% faster than baseline japanese encode Text: 1.26 μs ± 19 ns, 88% faster than baseline LazyText: 1.28 μs ± 23 ns, 91% faster than baseline 

As expected, we receive astonishing speed up for non-ASCII data. Results for English texts are a bit less impressive, but bear in mind that master recently gained (#302) a twice-faster SIMD-based encoder, which is very good for pure, uninterrupted ASCII. And for ascii benchmark, where the input is 50M long, memory bandwidth becomes a bottleneck, throttling speed up opportunities.

Moving to decodeUtf8, it's worth noticing that this is not a no-op. While for a valid UTF-8 ByteString it suffices just to copy it to Text, one must check that the input is valid first. Naively, validating UTF-8 encoding is no faster than parsing characters one-by-one with appropriate error reporting. However, we employ simdutf library, which validates Unicode using a vectorised state machine.

git checkout master cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --csv text-master-decode.csv git checkout utf8 cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --baseline text-master-decode.csv
tiny decode Text: 34.0 ns ± 286 ps, 36% faster than baseline LazyText: 150 ns ± 106 ps, 42% faster than baseline ascii-small decode Text: 5.94 μs ± 92 ns, 58% faster than baseline LazyText: 8.68 μs ± 148 ns, 52% faster than baseline ascii decode Text: 10.3 ms ± 163 μs, 54% faster than baseline LazyText: 74.5 ms ± 406 μs, 59% faster than baseline english decode Text: 398 μs ± 3.6 μs, 63% faster than baseline LazyText: 575 μs ± 7.3 μs, 46% faster than baseline russian decode Text: 1.88 μs ± 27 ns, 95% faster than baseline LazyText: OK (5.02s) 2.41 μs ± 30 ns, 93% faster than baseline japanese decode Text: 2.15 μs ± 5.2 ns, 93% faster than baseline LazyText: OK (22.11s) 2.63 μs ± 13 ns, 91% faster than baseline 

Again, results for non-ASCII inputs are pretty much fantastic, while English texts are less impressive (but still 2x faster), for the similar reasons as above. tiny benchmark is really tiny: just five letters, so there is not enough runway for simdutf vectorised state machine to accelerate, and we receive comparably modest speed up.

With regards to the third stated goal, as mentioned earlier, the geometric mean over all benchmarks is 0.33. Obviously, this number does not quite characterise them in full and indeed it's a mixed bag. I'll cover most notable regressions (and most notable improvements!) later this week.

@chrisdone
Copy link
Member

Great report, thanks @Bodigrim. I am highly enthusiastic about this change. 👏

Copy link
Member

@chrisdone chrisdone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cursory look; limited time right now.

jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 26, 2021
copyI semantics changed renaming for 16 -> 8 use iter* functions from Text directly
@tomjaguarpaw
Copy link
Member

I'm really pleased to see this! I have had a look at each of the commits in order to get an overall impression of the PR. I don't feel that I have the expertise in this area to give a review though.

@Bodigrim
Copy link
Contributor Author

Performance report: regressions

(Bear in mind that I recently fixed more performance issues, so earlier reports are no longer fully relevant)

I was composing the report below piecewise, so numbers do not belong to the same commit (because this is a moving target and benchmarks take hours) or the same machine. However, I believe it reliably characterises key classes of performance issues. I'm happy to delve deeper and answer questions about specific instances or use cases, if desirable.

Data.Text.readFile vs. T.decodeUtf8 . ByteString.readFile

All.FileRead.Text,21440148375,32011948000,1.49308 All.FileRead.LazyText,19729004900,30573543125,1.54967 All.FileRead.TextByteString,23057155743,3958878015,0.171698 All.FileRead.LazyTextByteString,20230595575,4249293104,0.210043 

These results correspond to

[ bench "Text" $ whnfIO $ T.length <$> T.readFile p , bench "LazyText" $ whnfIO $ LT.length <$> LT.readFile p , bench "TextByteString" $ whnfIO $ (T.length . T.decodeUtf8) <$> SB.readFile p , bench "LazyTextByteString" $ whnfIO $ (LT.length . LT.decodeUtf8) <$> LB.readFile p ] 

First two benchmarks measure locale-dependent file reading of a Russian text. The nature of locale-dependent reading is that GHC.IO.Buffer first decodes an input file from a system locale to UTF32-encoded buffer, and then Data.Text.IO converts UTF-32 buffer to Text. As discussed above, it's expected that decoding UTF-32 to UTF-8 is a slower (up to 55%) process than to UTF-16 (which mostly boils down to Word32 truncation). There is nothing to win here, as long as we are limited by GHC.IO.Buffer.

However, it was argued long ago that users should beware of TL.readFile and use T.decodeUtf8 . ByteString.readFile instead. And indeed as we can see from two latter benchmarks that T.decodeUtf8 . ByteString.readFile is 5x faster now, which in my opinion completely redeems slow down in T.readFile.

All.ReadLines.Text,26991334300,38123337400,1.41243 All.Programs.Fold,154487472000,197604059800,1.27909 

Same here: this is a locale-dependent file reading of a Russian text.

All.Programs.Throughput.Text,36687180825,60851039100,1.65865 All.Programs.Throughput.LazyText,32469217750,55323163500,1.70386 All.Programs.Throughput.TextByteString,29604138418,4192649650,0.141624 All.Programs.Throughput.LazyTextByteString,27621090150,4613553668,0.16703 

And here yet again: while locale-dependent reading regresses, decoding known UTF-8 inputs is 6-7x faster.

Tiny benchmarks

All.Pure.tiny.drop.Text,15132,25045,1.6551 All.Pure.tiny.take.Text,16078,23947,1.48943 All.Pure.tiny.length.cons.Text,12695,23493,1.85057 All.Pure.tiny.length.drop.Text,19860,36794,1.85267 All.Pure.tiny.length.drop.LazyText,28536,40957,1.43527 All.Pure.tiny.length.init.Text,16658,24041,1.44321 All.Pure.tiny.length.init.LazyText,23645,34648,1.46534 All.Pure.tiny.length.map.Text,17709,24007,1.35564 All.Pure.tiny.length.replicate char.Text,18784,29627,1.57725 All.Pure.tiny.length.replicate string.Text,18014,31220,1.7331 All.Pure.tiny.length.take.Text,16378,31957,1.95122 All.Pure.tiny.length.take.LazyText,27934,40618,1.45407 All.Pure.tiny.length.tail.Text,15305,25656,1.67631 All.Pure.tiny.length.tail.LazyText,18594,28876,1.55297 

tiny benchmarks are really tiny: the text is only 5 characters long, and all operations are in nano-range, so unlikely to be a bottleneck, as long as nothing is seriously slower. The explanation is that drop / take / length now use 512-bit vectorised implementations, which need certain runway to accelerate.

Plenty of other tiny benchmarks are faster, so the geometric mean of this group is 0.82, still well below one.

Filtering

All.Pure.ascii-small.filter.filter.Text,71800920,160835488,2.24002 All.Pure.ascii-small.length.filter.Text,68873323,138127519,2.00553 All.Pure.ascii-small.length.filter.filter.Text,68676682,135933025,1.97932 All.Pure.ascii.filter.filter.Text,62801740200,139964924300,2.22868 All.Pure.ascii.length.filter.Text,60781141300,121388757200,1.99715 All.Pure.ascii.length.filter.filter.Text,60407876225,120215499200,1.99006 All.Pure.english.filter.filter.Text,4339939587,9306667587,2.14442 All.Pure.english.length.filter.Text,4016696125,7925535487,1.97315 All.Pure.english.length.filter.filter.Text,3977912193,8051621793,2.02408 All.Pure.russian.filter.filter.Text,7496362,28343670,3.78099 All.Pure.russian.length.filter.Text,7225880,27338591,3.78343 All.Pure.russian.length.filter.filter.Text,7363396,25771804,3.49999 All.Pure.japanese.filter.filter.Text,5054263,19458210,3.84986 All.Pure.japanese.length.filter.Text,4508120,15550037,3.44934 All.Pure.japanese.length.filter.filter.Text,4561359,14601019,3.20102 

The thing is that benchmarks for filter are quite unrepresentative:

 , bgroup "filter" [ benchT $ nf (T.length . T.filter p0) ta , benchTL $ nf (TL.length . TL.filter p0) tla ]
 c = 'й' p0 = (== c)

As one might expect, T.filter (== 'й') returns an empty Text for anything which is not Russian, and even for Russian the output is a tiny fraction of an input (one should rather use T.replicate and T.count). Essentially, these benchmarks measure parsing a buffer into a stream of Char (and discarding the stream outright). As discussed previously, parsing is expected to be faster for UTF-16.

I would also like to mention that filtering Unicode can produce meaningful results only in a handful of scenarios. E. g., T.filter isAscii produces different results depending on Unicode normalization of observably indistinguishable inputs. It's unlikely that in a well-designed system T.filter becomes a bottleneck.

Mapping

All.Pure.ascii-small.mapAccumL.Text,608466155,1565305392,2.57254 All.Pure.ascii-small.mapAccumL.LazyText,544866703,1232343353,2.26173 All.Pure.ascii-small.mapAccumR.Text,728927382,1694930725,2.32524 All.Pure.ascii-small.mapAccumR.LazyText,716806537,1355231652,1.89065 All.Pure.ascii.mapAccumL.Text,986479975600,2729346036400,2.76675 All.Pure.ascii.mapAccumL.LazyText,1043058781400,2070791592400,1.98531 All.Pure.ascii.mapAccumR.Text,1422074596600,2720855812100,1.9133 All.Pure.ascii.mapAccumR.LazyText,1490645122500,2148077009200,1.44104 All.Pure.english.mapAccumL.Text,60874261256,176335668000,2.89672 All.Pure.english.mapAccumL.LazyText,52894601675,137879584200,2.60669 All.Pure.english.mapAccumR.Text,71870856700,187798907350,2.613 All.Pure.english.mapAccumR.LazyText,80292101500,144580070800,1.80068 All.Pure.russian.mapAccumL.Text,59171156,93213606,1.57532 All.Pure.russian.mapAccumL.LazyText,39046606,91835323,2.35194 All.Pure.russian.mapAccumR.Text,67980959,94374382,1.38825 All.Pure.russian.mapAccumR.LazyText,66807119,94789778,1.41886 All.Pure.japanese.mapAccumL.LazyText,30840265,46513529,1.50821 All.Pure.japanese.mapAccumR.Text,49972835,55210934,1.10482 

mapAccumL / mapAccumR certainly regress badly. Unfortunately, the nature of these functions require us to parse and print character-by-character, which is slower for UTF-8 than for UTF-16. Originally I did not anticipate that worse CPU branch prediction can lead to such drastic performance difference.

I believe that this is acceptable nevertheless, simply because I struggle to find any real-world use cases for mapAccum{L,R}. Again, it's incredibly difficult to imagine a usage of mapAccum{L,R}, which carefully handles Unicode intricacies and subtleties, but still bound by performance of UTF-8 parsing.

All.Pure.tiny.map.Text,69715,46306,0.664219 All.Pure.tiny.map.LazyText,172889,57529,0.332751 All.Pure.tiny.map.map.Text,35199,51573,1.46518 All.Pure.tiny.map.map.LazyText,147299,61317,0.416276 All.Pure.ascii-small.map.Text,570986035,313665625,0.54934 All.Pure.ascii-small.map.LazyText,1495289648,307545056,0.205676 All.Pure.ascii-small.map.map.Text,138758526,393272460,2.83422 All.Pure.ascii-small.map.map.LazyText,1399017382,355304809,0.253967 All.Pure.ascii.map.Text,581282600000,265156600000,0.456158 All.Pure.ascii.map.LazyText,1431011100000,306147300000,0.213938 All.Pure.ascii.map.map.Text,117102675000,338362775000,2.88945 All.Pure.ascii.map.map.LazyText,1347195900000,347654612500,0.258058 All.Pure.english.map.Text,40569093750,17671465625,0.435589 All.Pure.english.map.LazyText,92442137500,17780534375,0.192342 All.Pure.english.map.map.Text,8030321875,22338706250,2.78179 All.Pure.english.map.map.LazyText,84480615625,20408321875,0.241574 All.Pure.russian.map.Text,54396923,46203198,0.849372 All.Pure.russian.map.LazyText,139894458,45981140,0.328685 All.Pure.russian.map.map.Text,12951262,54321185,4.19428 All.Pure.russian.map.map.LazyText,125892663,54785674,0.435178 All.Pure.japanese.map.Text,37302227,32993960,0.884504 All.Pure.japanese.map.LazyText,95479821,32010337,0.335258 All.Pure.japanese.map.map.Text,9422100,34822103,3.69579 All.Pure.japanese.map.map.LazyText,90500512,35118249,0.388045 

Similar to above, it's not surprising for map succ to regress. It's actually more surprising that results are not the same across the board and 3/4 of benchmarks became faster! Geometric mean is 0.62 over this group. Again, I do not expect a real-life application of map f, which faithfully processes Unicode, to be bottlenecked on UTF-8 parsing, because f must be quite involved.

If you are worried about performance of case conversions, worry no longer. They unanimously vote in favor of UTF-8 (with geometric mean 0.42):

All.Pure.tiny.toLower.Text,414602,186216,0.449144 All.Pure.tiny.toLower.LazyText,437520,232168,0.530645 All.Pure.tiny.toUpper.Text,424902,197850,0.465637 All.Pure.tiny.toUpper.LazyText,446135,251065,0.562756 All.Pure.ascii-small.toLower.Text,5427945312,2014412500,0.371119 All.Pure.ascii-small.toLower.LazyText,5603268750,2170191210,0.387308 All.Pure.ascii-small.toUpper.Text,5399546875,2515530078,0.465878 All.Pure.ascii-small.toUpper.LazyText,5807147265,2683082812,0.462031 All.Pure.ascii.toLower.Text,4645536200000,1836974600000,0.395428 All.Pure.ascii.toLower.LazyText,4860871800000,1819362600000,0.374287 All.Pure.ascii.toUpper.Text,4778418000000,2250802600000,0.471035 All.Pure.ascii.toUpper.LazyText,4997500800000,2288388800000,0.457907 All.Pure.english.toLower.Text,316019437500,126184218750,0.399293 All.Pure.english.toLower.LazyText,324979400000,126374043750,0.388868 All.Pure.english.toUpper.Text,317852300000,150322700000,0.472933 All.Pure.english.toUpper.LazyText,333748737500,156130225000,0.467808 All.Pure.russian.toLower.Text,497689624,195327001,0.392467 All.Pure.russian.toLower.LazyText,525514331,209794140,0.399217 All.Pure.russian.toUpper.Text,497672460,232213159,0.466598 All.Pure.russian.toUpper.LazyText,532040087,249128051,0.468251 All.Pure.japanese.toLower.Text,347989550,120762292,0.347029 All.Pure.japanese.toLower.LazyText,368167578,133719238,0.363202 All.Pure.japanese.toUpper.Text,353876123,121747778,0.344041 All.Pure.japanese.toUpper.LazyText,375892382,136650164,0.363535 

Appending builders

All.Pure.tiny.Builder.mappend char,22403727,44948850,2.00631 All.Pure.tiny.Builder.mappend 8 char,79327,105339,1.32791 All.Pure.tiny.Builder.mappend text,501421785,478581864,0.95445 All.Pure.ascii-small.Builder.mappend char,23652920,45294095,1.91495 All.Pure.ascii-small.Builder.mappend 8 char,90781,110436,1.21651 All.Pure.ascii-small.Builder.mappend text,570745780,498310443,0.873087 All.Pure.ascii.Builder.mappend char,24550711,46687537,1.90168 All.Pure.ascii.Builder.mappend 8 char,102611,113728,1.10834 All.Pure.ascii.Builder.mappend text,593820546,538251198,0.906421 All.Pure.english.Builder.mappend char,24330778,46123828,1.8957 All.Pure.english.Builder.mappend 8 char,106896,107841,1.00884 All.Pure.english.Builder.mappend text,585586463,521747294,0.890983 All.Pure.russian.Builder.mappend char,23933431,45348496,1.89478 All.Pure.russian.Builder.mappend 8 char,108024,104745,0.969646 All.Pure.russian.Builder.mappend text,601874503,511843154,0.850415 All.Pure.japanese.Builder.mappend char,24568698,45829181,1.86535 All.Pure.japanese.Builder.mappend 8 char,103933,111322,1.07109 All.Pure.japanese.Builder.mappend text,580084975,522780439,0.901214 

mappend char is a benchmark, which glues together 10000 T.singleton. Since UTF-8 encoding is more involved, this takes more time that for UTF-16. However, results for mappend text (which glues together 10000 short texts) demonstrate that this slow down is limited only to rather artificial scenarios.

Consing and unconsing of lazy Text

All.Pure.ascii.cons.LazyText,15004236,16599998,1.10635 All.Pure.ascii.tail.LazyText,14748675,16838993,1.14173 All.Pure.english.tail.LazyText,354389,461140,1.30123 All.Pure.russian.tail.LazyText,21311,19200,0.900943 All.Pure.japanese.tail.LazyText,21753,20404,0.937986 All.Pure.ascii.uncons.LazyText,14884936,16790311,1.12801 All.Pure.ascii-small.uncons.LazyText,31627,33086,1.04613 All.Pure.english.uncons.LazyText,389700,389421,0.999284 All.Pure.russian.uncons.LazyText,30911,28948,0.936495 All.Pure.japanese.uncons.LazyText,33544,31087,0.926753 

Benchmarks for lazy cons / uncons / tail are pretty meaningless: while operations themselves succeed in nanoseconds, most of the time is spent in forcing a chain of chunks and memory churn. Their strict counterparts do not signal any regressions.

Japanese

All.Pure.japanese.append.Text,1557057,2219109,1.42519 All.Pure.japanese.words.LazyText,83176171,94262719,1.13329 All.Pure.japanese.zipWith.Text,82934658,113370141,1.36698 

Japanese benchmarks are especially challenging for UTF-8, because it is an alphabet, which takes 50% more space in UTF-8 than in UTF-16 and its parsing is thrice more difficult. So it's not surprising to have regressions, but it's surprising to have only a few and all below 50%. Actually, geometric mean of Japanese group is 0.26.

Summary

It was known since inception that certain benchmarks are likely to regress. The regressions are mostly limited to malformed benchmarks or scenarios, unlikely to emerge in a real-world application, and are redeemed by gains in other areas.

@emilypi
Copy link
Member

emilypi commented Aug 27, 2021

Just running to confirm the benchmarks on my machine, I get the following with GHC 8.10.4:

On master:

All Pure tiny encode Text: OK (0.16s) 33.1 ns ± 2.9 ns LazyText: OK (0.14s) 60.3 ns ± 5.9 ns ascii-small encode Text: OK (0.24s) 13.0 μs ± 847 ns LazyText: OK (0.66s) 76.4 μs ± 5.0 μs ascii encode Text: OK (1.43s) 14.1 ms ± 1.3 ms LazyText: OK (1.48s) 87.5 ms ± 2.7 ms english encode Text: OK (0.48s) 761 μs ± 67 μs LazyText: OK (0.56s) 3.92 ms ± 350 μs russian encode Text: OK (0.56s) 8.15 μs ± 442 ns LazyText: OK (0.50s) 15.0 μs ± 370 ns japanese encode Text: OK (0.39s) 11.6 μs ± 1.1 μs LazyText: OK (0.39s) 10.9 μs ± 832 ns 

On utf8:

All Pure tiny encode Text: OK (0.23s) 13.0 ns ± 686 ps, 60% faster than baseline LazyText: OK (0.43s) 24.5 ns ± 1.9 ns, 59% faster than baseline ascii-small encode Text: OK (0.69s) 4.86 μs ± 187 ns, 62% faster than baseline LazyText: OK (0.77s) 5.27 μs ± 177 ns, 93% faster than baseline ascii encode Text: OK (1.82s) 4.12 ms ± 97 μs, 70% faster than baseline LazyText: OK (0.72s) 36.3 ms ± 1.1 ms, 58% faster than baseline english encode Text: OK (0.43s) 306 μs ± 31 μs, 59% faster than baseline LazyText: OK (0.80s) 342 μs ± 24 μs, 91% faster than baseline russian encode Text: OK (0.35s) 584 ns ± 57 ns, 92% faster than baseline LazyText: OK (0.37s) 612 ns ± 35 ns, 95% faster than baseline japanese encode Text: OK (0.74s) 634 ns ± 48 ns, 94% faster than baseline LazyText: OK (0.75s) 650 ns ± 33 ns, 94% faster than baseline 

Some of the benches in roughly the same ballpark, but others are drastically faster. Notably, some of the lazy text on my machine are much faster. This is great! I'll have a substantive review shortly.

@L-as
Copy link

L-as commented Aug 27, 2021

Perhaps I've missed this, but how is the performance difference on non-x86-64 platforms?

@Bodigrim
Copy link
Contributor Author

@L-as underlying simdutf library offers vectorised UTF-8 validation for arm64 and ppc64 architectures as well.

@Bodigrim
Copy link
Contributor Author

Speaking of not-so-synthetic benchmarks, here is prettyprinter, measured against text-1.2.5.0:

All 80 characters, 50% ribbon prettyprinter layoutPretty: OK (2.79s) 180 ms ± 14 ms, 22% faster than baseline layoutSmart: OK (2.76s) 180 ms ± 9.9 ms, 22% faster than baseline layoutCompact: OK (2.50s) 162 ms ± 7.6 ms Infinite/large page width prettyprinter layoutPretty: OK (5.82s) 184 ms ± 11 ms, 20% faster than baseline layoutSmart: OK (1.34s) 181 ms ± 14 ms, 24% faster than baseline layoutCompact: OK (1.22s) 166 ms ± 16 ms 
All Many small words Unoptimized: OK (2.85s) 2.72 μs ± 141 ns, 50% faster than baseline Shallowly fused: OK (2.23s) 533 ns ± 42 ns, 29% faster than baseline Deeply fused: OK (2.23s) 532 ns ± 26 ns, 29% faster than baseline vs. other libs renderPretty this, unoptimized: OK (3.01s) 5.68 μs ± 273 ns, 32% faster than baseline this, shallowly fused: OK (1.31s) 5.02 μs ± 410 ns, 34% faster than baseline this, deeply fused: OK (2.64s) 5.05 μs ± 238 ns, 35% faster than baseline renderSmart this, unoptimized: OK (1.71s) 6.52 μs ± 592 ns, 31% faster than baseline this, shallowly fused: OK (5.70s) 5.45 μs ± 215 ns, 35% faster than baseline this, deeply fused: OK (2.90s) 5.63 μs ± 333 ns, 32% faster than baseline renderCompact this, unoptimized: OK (1.21s) 4.65 μs ± 448 ns, 42% faster than baseline this, shallowly fused: OK (2.18s) 4.22 μs ± 341 ns, 43% faster than baseline this, deeply fused: OK (9.00s) 4.29 μs ± 190 ns, 43% faster than baseline 
@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 6, 2021

Is there some documentation somewhere that outlines the plan for getting this merged and transitioning the community?

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition. The outline is as follows:

  • Merge utf8 branch into text HEAD.
  • Relax upper bounds for text in parsec and Cabal HEAD.
  • Bump text submodule (and parsec / Cabal as well) in GHC source tree.
  • Work through head.hackage to provide migration patches.
  • Do a proper text-2.0 release on Hackage.
@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 6, 2021

I addressed all the feedback above and pushed changes. Unless there are critical bugs, I'd appreciate if we wrap up and merge the branch by the end of the week, so that further work is unblocked.

This is a final call for reviews and approvals.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions about the doc updates.

@ketzacoatl
Copy link

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition.

@Bodigrim, is there a way to use the new text package in projects for testing, ahead of when it's included in GHC? or does this question not make sense?

Copy link

@Boarders Boarders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Bodigrim - I read through what I could and it looks good to me.

@ketzacoatl
Copy link

Are you looking for something like this:

@Bodigrim, more like your confirmation that you'd expect that type of reference to work. And I'll take your answer as a yes. Thanks!

Copy link
Member

@emilypi emilypi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Amazing job @Bodigrim 🎉

@Bodigrim Bodigrim merged commit 3488190 into haskell:master Sep 8, 2021
@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 8, 2021

I think after 150 comments and 200 likes we are in a good position to merge. Thanks everyone for active participation, feel free to provide more feedback here or in separate issues.

@Bodigrim Bodigrim deleted the utf8 branch September 8, 2021 21:03
@ghost
Copy link

ghost commented Sep 9, 2021

Congratulations @Bodigrim, fantastic work, well done for getting it across the line.

@Bodigrim
Copy link
Contributor Author

This is old news, but the PR has been released as a part of text-2.0. My ZuriHac talk, covering 10-years-long story of UTF-8 transition, is available at https://www.youtube.com/watch?v=1qlGe2qnGZQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet