DEV Community

Vee Satayamas
Vee Satayamas

Posted on

Thai word tokenizers benchmark: nlpo3 vs newmm

Thanathip Suntorntip Gorlph ported Korakot Chaovavanich's Thai word tokenizer - Newmm, written in Python, to Rust called nlpo3. The nlpo3 website claimed that nlpo3 is 2X faster than Newmm. I felt that Nlpo3 must be faster than this claim because in contrast to Python's Regex engine, Rust's regex runs in the linear time since it was constrained not to support looking back/ahead. Moreover, 2X faster is ambiguous.

So I conducted a bit different experiment on Mac mini M1. Both Nlpo3 and Newmm run on from Zsh instead of Python Notebook. I tested on 1 million lines of Thai Wikipedia snapshot. The result is that Newmm took 3.66X of the time that Nlpo3 required for tokenizing the same text on the same computer.

Setup

  • Computer: Scaleway's Mac mini M1
  • Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
  • Python: Python 3.8.2
  • OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
  • Script:
#!/bin/bash set -x INPUT=thwik-head1m.txt for i in {1..10} do { time python3 newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt { time nlpo3 segment < $INPUT > cham.out ; } 2>> bench_o3.txt done 
Enter fullscreen mode Exit fullscreen mode
  • A command line interface for newmm:
from pythainlp import word_tokenize import sys for line in sys.stdin: print("|".join(word_tokenize(line[:-1]))) 
Enter fullscreen mode Exit fullscreen mode

Result

nlpo3

[root@exper1 ~]# % grep real bench_o3.txt real 2m10.923s real 2m12.014s real 2m10.931s real 2m9.448s real 2m9.055s real 2m10.570s real 2m10.672s real 2m10.140s real 2m11.220s real 2m9.941s 
Enter fullscreen mode Exit fullscreen mode

newmm

% grep real bench_newmm.txt real 7m52.180s real 7m58.090s real 7m57.071s real 8m9.779s real 7m54.576s real 7m52.807s real 7m59.109s real 7m58.489s real 7m59.604s real 7m57.844s 
Enter fullscreen mode Exit fullscreen mode

Average

  • nlpo3
% grep real bench_o3.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}' 130.49140000000003 
Enter fullscreen mode Exit fullscreen mode
  • newmm
% grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}' 477.9549 
Enter fullscreen mode Exit fullscreen mode

Performance ratio

3.66

Top comments (0)