Skip to content

cisnlp/GlotScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotScript

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

Resource

What writing system is each language written in?

Example:

Language CORE AUXILLARY
Turkish (tur) Latn Arab, Cyrl, Grek
Thai (tha) Thai Latn
Vietnamese (vie) Latn Hani

See metadata folder for more languages.

Tool

It's a Python library that detects the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install

from pip

pip3 install GlotScript

from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

Script Detection

from GlotScript import sp
sp('これは日本人です') >> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1] >> ('Latn', 1.0)
sp('මේක සිංහල')[0] >> 'Sinh'

Script Separation

from GlotScript import sc 
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا" sc(sent) >> { "Latn":"Hello Salut ", "Hebr":" שלום ", "Arab":" سلام مرحبا", "Hani":" 你好 ", "Hira":" こんにちは " }

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.

@article{kargaran2023glotscript, title={GlotScript: A Resource and Tool for Low Resource Writing System Identification}, author={Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, journal={arXiv preprint arXiv:2309.13320}, year={2023} }