Posted on Mar 8, 2024

Tokenization Technique for None-spacing Words

In Natural Language Processing to identify words from sentence in English or Latin characters is not too hard, because each word is has a space. But in Unicode character is different we need to make it compare to existing words from dictionary.

Dictionary Format:

You can structure your dictionary to include related words and explanatory phrases. Here's an example format:

Example:

khmer_dictionary = { 'មាន': {'POS': 'Verb', 'Related': ['មានសៀវភៅ', 'មានទិន្នន័យ'], 'Explanation': 'to have'}, 'សៀវភៅ': {'POS': 'Noun', 'Related': [], 'Explanation': 'book'}, 'ច្រើន': {'POS': 'Adjective', 'Related': [], 'Explanation': 'many'}, 'ណាស់': {'POS': 'Adverb', 'Related': [], 'Explanation': 'here'}, 'នៅ': {'POS': 'Verb', 'Related': [], 'Explanation': 'to be at'}, 'ទីនេះ': {'POS': 'Noun', 'Related': [], 'Explanation': 'this place'} }

Improving Tokenization Method:

To handle multi-word phrases and OOV words better, you need to adjust your tokenization function. Here's a revised version.

def tokenize_with_dictionary(sentence): tokens = [] current_word = '' for char in sentence: current_word += char if current_word in khmer_dictionary: tokens.append((current_word, khmer_dictionary[current_word])) current_word = '' elif current_word[:-1] in khmer_dictionary: tokens.append((current_word[:-1], khmer_dictionary[current_word[:-1]])) current_word = char if current_word: tokens.append((current_word, 'OOV')) return tokens

Then you can save it to database.
If you have better idea or something for improvement, please comments below.