Skip to content

cisnlp/GlotScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotScript

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

Resource

What writing system is each language written in?

Example:

Language CORE AUXILLARY
Turkish (tur) Latn Arab, Cyrl, Grek
Thai (tha) Thai Latn
Vietnamese (vie) Latn Hani

See metadata folder for more languages.

Tool

It's a Python library that detects the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install

from pip

pip3 install GlotScript

from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

Script Detection

from GlotScript import sp
sp('これは日本人です') >> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1] >> ('Latn', 1.0)
sp('මේක සිංහල')[0] >> 'Sinh'

Script Separation

from GlotScript import sc 
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا" sc(sent) >> { "Latn":"Hello Salut ", "Hebr":" שלום ", "Arab":" سلام مرحبا", "Hani":" 你好 ", "Hira":" こんにちは " }

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.

@inproceedings{kargaran-etal-2024-glotscript-resource,  title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",  author = {Kargaran, Amir Hossein and  Yvon, Fran{\c{c}}ois and  Sch{\"u}tze, Hinrich},  editor = "Calzolari, Nicoletta and  Kan, Min-Yen and  Hoste, Veronique and  Lenci, Alessandro and  Sakti, Sakriani and  Xue, Nianwen",  booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",  month = may,  year = "2024",  address = "Torino, Italia",  publisher = "ELRA and ICCL",  url = "https://aclanthology.org/2024.lrec-main.687",  pages = "7774--7784" }