Posted on Nov 20, 2024

Day 37: Named Entity Recognition (NER) with LLMs

Introduction

Named Entity Recognition (NER) is a crucial Natural Language Processing (NLP) task that identifies and classifies entities like names, locations, organizations, dates, and more in a given text. With the power of Large Language Models (LLMs), NER has reached unparalleled accuracy and versatility.

Why LLMs for NER?

Contextual Understanding: LLMs, like BERT and GPT, excel at capturing the surrounding context, enabling accurate entity recognition even in ambiguous scenarios.
Transfer Learning: Pretrained models can be fine-tuned on domain-specific datasets for high-performance NER tasks.
Scalability: Minimal effort is required to adapt LLMs to new languages or datasets.

Steps to Implement NER

1. Data Preparation

Collect labeled NER datasets (e.g., CoNLL-2003, OntoNotes).
Preprocess data to match the input requirements of the chosen LLM.

2. Model Selection

Popular models: BERT, DistilBERT, RoBERTa, GPT, spaCy transformer-based pipelines.
Use the Hugging Face transformers library for quick implementation.

3. Fine-tuning

Fine-tune the model using frameworks like PyTorch or TensorFlow on labeled data.

4. Evaluation

Use metrics like F1-score, precision, and recall to evaluate performance.

Example: Fine-Tuning BERT for NER

Here’s an example using Hugging Face's transformers library:

import torch from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments from datasets import load_dataset from evaluate import load # Load dataset dataset = load_dataset("conll2003") # Load tokenizer and model model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) # Get number of labels label_list = dataset["train"].features["ner_tags"].feature.names num_labels = len(label_list) # Load model with proper configuration model = AutoModelForTokenClassification.from_pretrained( model_name, num_labels=num_labels, id2label={i: label for i, label in enumerate(label_list)}, label2id={label: i for i, label in enumerate(label_list)} ) def tokenize_and_align_labels(examples): tokenized_inputs = tokenizer( examples["tokens"], truncation=True, is_split_into_words=True, padding=True, max_length=512 ) labels = [] for i, label in enumerate(examples["ner_tags"]): word_ids = tokenized_inputs.word_ids(batch_index=i) previous_word_id = None label_ids = [] for word_id in word_ids: if word_id is None: label_ids.append(-100) elif word_id != previous_word_id: label_ids.append(label[word_id]) else: # For tokens that are part of the same word, use the same label  label_ids.append(-100) previous_word_id = word_id labels.append(label_ids) tokenized_inputs["labels"] = labels return tokenized_inputs # Process datasets tokenized_datasets = dataset.map( tokenize_and_align_labels, batched=True, remove_columns=dataset["train"].column_names ) # Load metric metric = load("seqeval") def compute_metrics(eval_preds): predictions, labels = eval_preds predictions = predictions.argmax(axis=-1) # Remove ignored index (special tokens)  true_predictions = [ [label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] true_labels = [ [label_list[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] results = metric.compute(predictions=true_predictions, references=true_labels) return { "precision": results["overall_precision"], "recall": results["overall_recall"], "f1": results["overall_f1"], "accuracy": results["overall_accuracy"], } # Define training arguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_dir="./logs", save_total_limit=2, save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", report_to="none", # Disable wandb logging ) # Initialize Trainer trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], tokenizer=tokenizer, compute_metrics=compute_metrics ) # Check if CUDA is available device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") # Train model trainer.train() # Evaluate model results = trainer.evaluate() print("\nEvaluation Results:") print(results) # Save the model trainer.save_model("./final_model") print("\nModel saved to ./final_model")

Applications of NER

Healthcare: Extracting medical entities from patient records.
Finance: Identifying company names, financial events, and monetary amounts.
Customer Service: Enhancing chatbots by recognizing user intents.
Content Curation: Automating tagging of articles and media.

Challenges and Tips

Ambiguity: Use larger models or ensemble techniques for disambiguation.
Domain-Specific Entities: Fine-tune on domain-specific datasets.
Evaluation: Ensure balanced datasets for unbiased evaluation metrics.

Conclusion

NER powered by LLMs has transformed information extraction, making it faster and more reliable across industries. By leveraging LLMs, we can unlock insights from unstructured text with ease.

DEV Community