Skip to content

cisnlp/GlotScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotScript

Detect the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.

Install

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

from GlotScript import get_script_predictor sp = get_script_predictor()
sp('これは日本人です') >> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1] >> ('Latn', 1.0)
sp('මේක සිංහල')[0] >> 'Sinh'
sp('𝄞𝄫 𒊕𒀸') >> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})

Citation

If you use any part of this library in your research, please cite it using the following BibTex entry.

@misc{glotscript, author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, title = {GlotScript}, year = {2023}, publisher = {GitHub}, journal = {GitHub Repository}, howpublished = {\url{https://github.com/cisnlp/GlotScript}}, } 

Exploring Unicode Blocks: Related Sources

Click to Exapand