-
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.
-
GlotScript-Resource: provides a resource displaying the writing systems for various languages.
What writing system is each language written in?
See metadata folder.
Detect the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from Wikipedia ISO_15924.
- Unicode ranges were extracted from Unicode Character Database.
Zinhcode is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.Zyyycode is the Unicode script for "Common" characters.Zzzzcode is for Unicode script for "uncoded" script.
pip3 install GlotScriptpip3 install GlotScript@git+https://github.com/cisnlp/GlotScriptfrom GlotScript import get_script_predictor sp = get_script_predictor()OR
from GlotScript import spsp('これは日本人です') >> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})sp('This is Latin')[:1] >> ('Latn', 1.0)sp('මේක සිංහල')[0] >> 'Sinh'sp('𝄞𝄫 𒊕𒀸') >> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})from GlotScript import separate_scriptsent = "Hello Salut سلام 你好 こんにちは שלום مرحبا" separate_script(sent) >> { "Latn":"Hello Salut ", "Hebr":" שלום ", "Arab":" سلام مرحبا", "Hani":" 你好 ", "Hira":" こんにちは " }Click to Exapand
- List of Unicode characters - Wikipedia
- Lightweight Plain-Text Editor for macOS - CotEditor
- The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty
- ISO_15924 Wikipedia
- Unicode Character Database (Blocks) - Unicode
- Unicode Character Database (Scripts) - Unicode
- A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1
- Kotlin - JetBrains
- UNIX-like reverse engineering framework and command-line toolset - radare2
- FreeOrion Game
- DOMinator - Firefox
- SHSans-derived CJK font family - glow-sans
- Unicode Subset Bitfields - Microsoft
- Stops - FAIR NLLB FB
- Gradient Boosting on Decision Trees - catboost
- Blender
- Unicode Wikipedia
If you use any part of this library in your research, please cite it using the following BibTex entry.
@article{kargaran2023glotscript, title = {GlotScript: A Resource and Tool for Low Resource Writing System Identification}, author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, year = 2023, journal = {arXiv preprint arXiv:2309.13320} }