GlotScript

GlotScript-Resource: provides a resource displaying the writing systems for various languages.
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

Resource

What writing system is each language written in?

Example:

Language	CORE	AUXILLARY
Turkish (tur)	Latn	Arab, Cyrl, Grek
Thai (tha)	Thai	Latn
Vietnamese (vie)	Latn	Hani

See metadata folder for more languages.

Tool

It's a Python library that detects the script (writing system) of text based on ISO 15924.

Unicode version: 15.0.0
The codes were sourced from Wikipedia ISO_15924.
Unicode ranges were extracted from Unicode Character Database.

Special codes

Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
Zyyy code is the Unicode script for "Common" characters.
Zzzz code is for Unicode script for "uncoded" script.

Install

from pip

pip3 install GlotScript

from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

Script Detection

from GlotScript import sp

sp('これは日本人です') >> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})

sp('This is Latin')[:1] >> ('Latn', 1.0)

sp('මේක සිංහල')[0] >> 'Sinh'

Script Separation

from GlotScript import sc

sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا" sc(sent) >> { "Latn":"Hello Salut ", "Hebr":" שלום ", "Arab":" سلام مرحبا", "Hani":" 你好 ", "Hira":" こんにちは " }

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.

@inproceedings{kargaran-etal-2024-glotscript-resource,  title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",  author = {Kargaran, Amir Hossein and  Yvon, Fran{\c{c}}ois and  Sch{\"u}tze, Hinrich},  editor = "Calzolari, Nicoletta and  Kan, Min-Yen and  Hoste, Veronique and  Lenci, Alessandro and  Sakti, Sakriani and  Xue, Nianwen",  booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",  month = may,  year = "2024",  address = "Torino, Italia",  publisher = "ELRA and ICCL",  url = "https://aclanthology.org/2024.lrec-main.687",  pages = "7774--7784" }

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
GlotScript		GlotScript
metadata		metadata
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GlotScript

Resource

Tool

Install

Usage

Exploring Unicode Blocks: Related Sources

Citation

About

Uh oh!

Releases 3

Uh oh!

Languages

License

cisnlp/GlotScript

Folders and files

Latest commit

History

Repository files navigation

GlotScript

Resource

Tool

Install

Usage

Exploring Unicode Blocks: Related Sources

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Languages