In Natural Language Processing to identify words from sentence in English or Latin characters is not too hard, because each word is has a space. But in Unicode character is different we need to make it compare to existing words from dictionary.
Dictionary Format:
You can structure your dictionary to include related words and explanatory phrases. Here's an example format:
Example:
khmer_dictionary = { 'មាន': {'POS': 'Verb', 'Related': ['មានសៀវភៅ', 'មានទិន្នន័យ'], 'Explanation': 'to have'}, 'សៀវភៅ': {'POS': 'Noun', 'Related': [], 'Explanation': 'book'}, 'ច្រើន': {'POS': 'Adjective', 'Related': [], 'Explanation': 'many'}, 'ណាស់': {'POS': 'Adverb', 'Related': [], 'Explanation': 'here'}, 'នៅ': {'POS': 'Verb', 'Related': [], 'Explanation': 'to be at'}, 'ទីនេះ': {'POS': 'Noun', 'Related': [], 'Explanation': 'this place'} }
Improving Tokenization Method:
To handle multi-word phrases and OOV words better, you need to adjust your tokenization function. Here's a revised version.
def tokenize_with_dictionary(sentence): tokens = [] current_word = '' for char in sentence: current_word += char if current_word in khmer_dictionary: tokens.append((current_word, khmer_dictionary[current_word])) current_word = '' elif current_word[:-1] in khmer_dictionary: tokens.append((current_word[:-1], khmer_dictionary[current_word[:-1]])) current_word = char if current_word: tokens.append((current_word, 'OOV')) return tokens
Then you can save it to database.
If you have better idea or something for improvement, please comments below.
Top comments (0)