CodeSwitch is an NLP tool, can use for language identification, pos tagging, name entity recognition, sentiment analysis of code mixed data.
We used LinCE dataset for training multilingual BERT model using huggingface transformers. LinCE has four language mixed data. We took three of it spanish-english, hindi-english and nepali-english. Hope we will train and add other language and task too.
- Spanish-English(spa-eng)
- Hindi-English(hin-eng)
- Nepali-English(nep-eng)
spa-engfor spanish-englishhin-engfor hindi-englishnep-engfor nepali-english
pip install codeswitch - pytorch >=1.6.0
- All three(lid, ner, pos) sequence tagging model was trainend with huggingface token classification
- Sentiment Analysis Model trained with huggingface text classification
- You can find every model and evaluation results here
- Language Identification
- spanish-english
- hindi-english
- nepali-english
- POS
- spanish-english
- hindi-english
- NER
- spanish-english
- hindi-english
- Sentiment Analysis
- spanish-english
from codeswitch.codeswitch import LanguageIdentification lid = LanguageIdentification('spa-eng') # for hindi-english use 'hin-eng', # for nepali-english use 'nep-eng' text = "" # your code-mixed sentence result = lid.identify(text) print(result)from codeswitch.codeswitch import POS pos = POS('spa-eng') # for hindi-english use 'hin-eng' text = "" # your mixed sentence result = pos.tag(text) print(result)from codeswitch.codeswitch import NER ner = NER('spa-eng') # for hindi-english use 'hin-eng' text = "" # your mixed sentence result = ner.tag(text) print(result)from codeswitch.codeswitch import SentimentAnalysis sa = SentimentAnalysis('spa-eng') sentence = "El perro le ladraba a La Gatita .. .. lol #teamlagatita en las playas de Key Biscayne este Memorial day" result = sa.analyze(sentence) print(result) # [{'label': 'LABEL_1', 'score': 0.9587041735649109}]