DEV Community

Keita Onabuta

Posted on Mar 30, 2020 • Edited on Apr 5, 2020

BERTによる日本語テキスト分類 (Azure)

#azure #bert #machinelearning #pytorch

This is Japanese article.

konabuta / AzureML-NLP

NLP for japanese language text.

AzureML-NLP

本リポジトリでは、Azure Machine Learning を利用した日本語の自然言語処理 NLP モデル構築のサンプルコードを提供します。Microsoft の NLP Best Practice を参考にしています。

コンテンツ

シナリオ	モデル	概要	対応言語
テキスト分類	BERT	テキストのカテゴリーを学習・推論する教師付き学習です。	Japanese

Get started

最初は Azure Cognitive Service の利用検討を推奨します。この学習済みのモデルで対応できない場合は、カスタムで機械学習モデルを構築する必要がございます。まず、Setup を参照し、必要なライブラリを導入してください。

View on GitHub

Microosft が公開している自然言語処理のベストプラクティス集 "NLP Best Practices" をベースにした日本語テキスト分類のサンプルコードを作成しました。

本家と大きく違う点は下記です。

日本語の BERT Tokenizer を利用する
- Mecab (+辞書) のダウンロードとインストールの手順を追加
日本語 PreTrained モデルを利用する
- Hugging Face のモデルを利用
サンプルデータとして Livedoor ニュースを利用

Mecabの辞書の導入が複雑なので本家とマージするかはまだ未定です。

コード(※抜粋)はこちらです。

1. Livedoor コーパスのデータ加工

# Livedoor ニュースコーパスをダウンロードして利用します。 from urllib.request import urlretrieve import tarfile text_url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz" file_path = "./ldcc-20140209.tar.gz" urlretrieve(text_url, file_path) # gz ファイルを解凍します。 with tarfile.open('./ldcc-20140209.tar.gz', 'r:gz') as tar: tar.extractall(path='livedoor') tar.close()

# Pandas Dataframe を作成します。 for folder_name in os.listdir(path): print(folder_name) if folder_name.endswith(".txt") : continue for file in os.listdir(os.path.join(path, folder_name)): if folder_name == "LICENSE.txt" : continue with open(os.path.join(path, folder_name, file), 'r') as f: lines = f.read().split('\n') if len(lines) == 1: continue url = lines[0] date = lines[1] label = folder_name title = lines[3] text = "".join(lines[4:]) data = {'url': url, 'date':date, 'label': label, 'title':title, 'text':text} s = pd.Series(data) df = df.append(s, ignore_index=True)

2. ファインチューニング

準備されている関数 util_nlp を利用します。

classifier = SequenceClassifier( model_name=model_name, num_labels=num_labels, cache_dir=CACHE_DIR ) with Timer() as t: classifier.fit( train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False, ) train_time = t.interval / 3600

精度確認を確認します。

# テストデータの予測 preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False) # 評価 accuracy = accuracy_score(df_test[LABEL_COL], preds) class_report = classification_report( df_test[LABEL_COL], preds, target_names=label_encoder.classes_, output_dict=True )

最終的な精度は 85% ぐらいでした。

accuracy : 0.866052
f1-score : 0.858849