How to Do Named Entity Recognition (NER) with a BERT Model

Named Entity Recognition (NER) is one of the fundamental building blocks of natural language understanding. When humans read text, we naturally identify and categorize named entities based on context and world knowledge. For instance, in the sentence “Microsoft’s CEO Satya Nadella spoke at a conference in Seattle,” we effortlessly recognize the organizational, personal, and geographical references. However, teaching machines to replicate this seemingly intuitive human capability presents several challenges. Fortunately, this problem can be addressed effectively using a pretrained machine learning model.

In this post, you will learn how to solve the NER problem with a BERT model using just a few lines of Python code.

Let’s get started.

How to Do Named Entity Recognition (NER) with a BERT Model
Picture by Jon Tyson. Some rights reserved.

Overview

This post is in six parts; they are:

The Complexity of NER Systems
The Evolution of NER Technology
BERT’s Revolutionary Approach to NER
Using DistilBERT with Hugging Face’s Pipeline
Using DistilBERT Explicitly with AutoModelForTokenClassification
Best Practices for NER Implementation

The Complexity of NER Systems

The challenge of Named Entity Recognition extends far beyond simple pattern matching or dictionary lookups. Several key factors contribute to its complexity.

One of the most significant challenges is context dependency—understanding how words change meaning based on surrounding text. The same word can represent different entity types depending on its context. Consider these examples:

“Apple announced new products.” (Apple is an organization.)
“I ate an apple for lunch.” (Apple is a common noun, not a named entity.)
“Apple Street is closed.” (Apple is a location.)

Named entities often consist of multiple words, making boundary detection another challenge. Entity names can be complex, such as:

Corporate entities: “Bank of America Corporation”
Product names: “iPhone 14 Pro Max”
Person names: “Martin Luther King Jr.”

Additionally, language is dynamic and continuously evolving. Instead of memorizing what qualifies as an entity, models must deduce it from context. Language evolution introduces new entities, such as emerging companies, new products, and newly coined terms.

Now, let’s explore how state-of-the-art NER models address these challenges.

The Evolution of NER Technology

The evolution of NER technology reflects the broader advancement of natural language processing. Early approaches relied on rule-based systems and pattern matching—defining grammatical patterns, identifying capitalization, and using contextual markers (e.g., “the” before a proper noun). However, these rules were often numerous, inconsistent, and difficult to scale.

To improve accuracy, researchers introduced statistical approaches, leveraging probability-based models such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) to identify named entities.

With the rise of deep learning, neural networks became the preferred method for NER. Initially, bidirectional LSTM networks showed promise. However, the introduction of attention mechanisms and transformer-based models proved to be even more effective.

BERT’s Revolutionary Approach to NER

BERT (Bidirectional Encoder Representations from Transformers) has fundamentally transformed NER with several key innovations:

Contextual Understanding

Unlike traditional models that process text in one direction, BERT’s bidirectional nature allows it to consider both preceding and following text. This enables it to capture long-range dependencies, understand subtle contextual nuances, and handle ambiguous cases more effectively.

Tokenization and Subword Units

While not exclusive to BERT, its subword tokenization strategy allows it to handle unknown words while preserving morphological information. This reduces vocabulary size and makes the model adaptable across different languages and domains.

The IOB Tagging Mechanism

NER results can be represented in various ways, but BERT uses the Inside-Outside-Beginning (IOB) tagging scheme:

B marks the beginning of an entity.
I indicates the continuation of an entity.
O signifies non-entities.

This method enables BERT to handle multi-word entities, nested entities, and overlapping entities effectively.

Using DistilBERT with Hugging Face’s Pipeline

The easiest way to perform NER is by using Hugging Face’s pipeline API, which abstracts away much of the complexity while still delivering powerful results. Here’s an example:

from transformers import pipeline # Initialize the NER pipeline ner_pipeline = pipeline(“ner”, model=”dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=”simple”) # Text example text = “Apple CEO Tim Cook announced new iPhone models in California yesterday.” # Perform NER entities = ner_pipeline(text) # Print the results for entity in entities: print(f”Entity: {entity[‘word’]}”) print(f”Type: {entity[‘entity_group’]}”) print(f”Confidence: {entity[‘score’]:.4f}”) print(“-” * 30)

from transformers import pipeline

# Initialize the NER pipeline

ner_pipeline = pipeline(“ner”,

model=“dbmdz/bert-large-cased-finetuned-conll03-english”,

aggregation_strategy=“simple”)

# Text example

text = “Apple CEO Tim Cook announced new iPhone models in California yesterday.”

# Perform NER

entities = ner_pipeline(text)

# Print the results

for entity in entities:

print(f“Entity: {entity[‘word’]}”)

print(f“Type: {entity[‘entity_group’]}”)

print(f“Confidence: {entity[‘score’]:.4f}”)

print(“-“ * 30)

Let’s break down to understand this code in detail. First is to initialize the pipeline:

ner_pipeline = pipeline(“ner”, model=”dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=”simple”)

ner_pipeline = pipeline(“ner”,

model=“dbmdz/bert-large-cased-finetuned-conll03-english”,

aggregation_strategy=“simple”)

The pipeline() function creates a ready-to-use NER pipeline. It is needed because BERT is a machine learning model but you need to preprocess the text before the model can consume and you need to convert the model’s output into a usable data structure. A pipeline connects these steps.

The argument "ner" specifies you want Named Entity Recognition and model="dbmdz/bert-large-cased-finetuned-conll03-english" loads a pre-trained model specifically fine-tuned for NER. The final argument aggregation_strategy="simple" tells the pipeline to merge subwords into complete words. This makes the output more readable.

The pipeline above returns a list of dictionaries, where each dictionary contains:

word: The detected entity text
entity_group: The type of entity (PER for person, ORG for organization, etc.)
score: Confidence score between 0 and 1
start and end: Character positions in the original text

This code will output something like:

Entity: Apple Type: ORG Confidence: 0.9987 —————————— Entity: Tim Cook Type: PER Confidence: 0.9956 —————————— Entity: California Type: LOC Confidence: 0.9934 ——————————

Entity: Apple

Type: ORG

Confidence: 0.9987

——————————

Entity: Tim Cook

Type: PER

Confidence: 0.9956

——————————

Entity: California

Type: LOC

Confidence: 0.9934

——————————

Using DistilBERT Explicitly with AutoModelForTokenClassification

For greater control over NER, you can bypass the pipeline API and interact directly with the model and tokenizer:

from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Text example text = “Google and Microsoft are competing in the AI space while Elon Musk founded SpaceX.” # Tokenize the text inputs = tokenizer(text, return_tensors=”pt”, add_special_tokens=True) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Convert predictions to labels label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist() # Process results current_entity = [] current_entity_type = None for token, prediction in zip(tokens, predictions): if token.startswith(“##”): if current_entity: current_entity.append(token[2:]) else: if current_entity: print(f”Entity: {”.join(current_entity)}”) print(f”Type: {current_entity_type}”) print(“-” * 30) current_entity = [] if label_list[prediction] != “O”: current_entity = [token] current_entity_type = label_list[prediction] # Print final entity if exists if current_entity: print(f”Entity: {”.join(current_entity)}”) print(f”Type: {current_entity_type}”)

from transformers import AutoTokenizer, AutoModelForTokenClassification

import torch

# Load model and tokenizer

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english”

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForTokenClassification.from_pretrained(model_name)

# Text example

text = “Google and Microsoft are competing in the AI space while Elon Musk founded SpaceX.”

# Tokenize the text

inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True)

# Get predictions

with torch.no_grad():

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

# Convert predictions to labels

label_list = model.config.id2label

tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0])

predictions = predictions[0].tolist()

# Process results

current_entity = []

current_entity_type = None

for token, prediction in zip(tokens, predictions):

if token.startswith(“##”):

if current_entity:

current_entity.append(token[2:])

else:

if current_entity:

print(f“Entity: {”.join(current_entity)}”)

print(f“Type: {current_entity_type}”)

print(“-“ * 30)

current_entity = []

if label_list[prediction] != “O”:

current_entity = [token]

current_entity_type = label_list[prediction]

# Print final entity if exists

if current_entity:

print(f“Entity: {”.join(current_entity)}”)

print(f“Type: {current_entity_type}”)

This implementation is longer. Let’s see the how it works in steps. First is to load the model and tokenizer:

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name)

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english”

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForTokenClassification.from_pretrained(model_name)

AutoTokenizer automatically selects the appropriate tokenizer for our model based on the model card. Usually a model should use a specific tokenizer only. A tokenizer is an algorithm to transform and split the input text string. AutoModelForTokenClassification loads the model specific to token classification tasks. The model created includes both the architecture and the pretrained weights for NER.

Then let the tokenizer to process the input text:

inputs = tokenizer(text, return_tensors=”pt”, add_special_tokens=True)

inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True)

This converts text into tokens that the model can understand. A token is usually a word but can also be a subword — such as “sub-” and “-word” are recognized separately even it is presented as one word in the text. The output of a tokenizer is a sequence of integers, which each integer corresponds to the token in the tokenizer’s dictionary. The argument return_tensors="pt" returns the sequence as PyTorch tensors. The argument add_special_tokens=True adds [CLS] and [SEP] tokens to the beginning and the end of the output, as required by BERT.

Next is to run the model with the input tensor:

with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2)

with torch.no_grad():

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

The context torch.no_grad() disables gradient calculation for inference. This saves time and memory in using the model. Calling model(**inputs) runs the forward pass and torch.argmax(outputs.logits, dim=2) transforms the output tensor into the most likely label for each token. The tensor predictions is a tensor of integers.

To read the output, we need to convert the integer output into labels. But let’s prepare the data structure for the conversion:

label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist()

label_list = model.config.id2label

tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0])

predictions = predictions[0].tolist()

Dictionary model.config.id2label is a mapping of prediction indices to actual entity labels. The function convert_ids_to_tokens converts integer token IDs back to readable text. Since you run the model with a single line of input text, only a sequence of output is expected. We convert the predictions to a Python list for easier processing.

The reconstruction of entity prediction output uses a for-loop. In BERT’s tokenizer, a subword is prefixed with "##", hence you can easily identify them and merge the subwords. The nature of the current entity is determined from the prediction and presented as a label using the dictionary label_list. These helps to present the result in a human-readable format.

Best Practices for NER Implementation

Doing NER is as simple as above. However, you are not required to use exactly the code above for NER. In particular, you can switch between different models (and also the corresponding tokenizer). If you need the model to run fast, you should pick a DistilBERT model. If you require the result to be accurate, a larger BERT or RoBERTa models should be chosen. You may also find a domain-adapted models if your input requires domain knowledge.

Moreover, if you need to process NER for a lot of input, you can do it faster by processing them in batch. There are also some other techniques to speed up the process, such as using GPU for acceleration or caching the result for frequently accessed texts.

In a production system, some error handling logic should be implemented as well. Such as validating the input, handling edge cases such as empty strings and special characters, and others.

Here’s a complete example incorporating these best practices:

from transformers import pipeline import torch import logging from typing import List, Dict class NERProcessor: def __init__(self, model_name: str = “dbmdz/bert-large-cased-finetuned-conll03-english”, confidence_threshold: float = 0.8): self.confidence_threshold = confidence_threshold try: self.device = “cuda” if torch.cuda.is_available() else “cpu” self.ner_pipeline = pipeline(“ner”, model=model_name, aggregation_strategy=”simple”, device=self.device) except Exception as e: logging.error(f”Failed to initialize NER pipeline: {str(e)}”) raise def process_text(self, text: str) -> List[Dict]: if not text or not isinstance(text, str): logging.warning(“Invalid input text”) return [] try: # Get predictions entities = self.ner_pipeline(text) # Post-process results filtered_entities = [ entity for entity in entities if entity[‘score’] >= self.confidence_threshold ] return filtered_entities except Exception as e: logging.error(f”Error processing text: {str(e)}”) return [] if __name__ == “__main__”: # Initialize processor processor = NERProcessor() # Text example text = “”” Apple Inc. CEO Tim Cook announced new partnerships with Microsoft and Google during a conference in New York City. The event was also attended by Sundar Pichai and Satya Nadella. “”” # Process text results = processor.process_text(text) # Print results for entity in results: print(f”Entity: {entity[‘word’]}”) print(f”Type: {entity[‘entity_group’]}”) print(f”Confidence: {entity[‘score’]:.4f}”) print(“-” * 30)

from transformers import pipeline

import torch

import logging

from typing import List, Dict

class NERProcessor:

def __init__(self,

model_name: str = “dbmdz/bert-large-cased-finetuned-conll03-english”,

confidence_threshold: float = 0.8):

self.confidence_threshold = confidence_threshold

try:

self.device = “cuda” if torch.cuda.is_available() else “cpu”

self.ner_pipeline = pipeline(“ner”,

model=model_name,

aggregation_strategy=“simple”,

device=self.device)

except Exception as e:

logging.error(f“Failed to initialize NER pipeline: {str(e)}”)

raise

def process_text(self, text: str) -> List[Dict]:

if not text or not isinstance(text, str):

logging.warning(“Invalid input text”)

return []

try:

# Get predictions

entities = self.ner_pipeline(text)

# Post-process results

filtered_entities = [

entity for entity in entities

if entity[‘score’] >= self.confidence_threshold

]

return filtered_entities

except Exception as e:

logging.error(f“Error processing text: {str(e)}”)

return []

if __name__ == “__main__”:

# Initialize processor

processor = NERProcessor()

# Text example

text = “”“

Apple Inc. CEO Tim Cook announced new partnerships with Microsoft

and Google during a conference in New York City. The event was also

attended by Sundar Pichai and Satya Nadella.

““”

# Process text

results = processor.process_text(text)

# Print results

for entity in results:

print(f“Entity: {entity[‘word’]}”)

print(f“Type: {entity[‘entity_group’]}”)

print(f“Confidence: {entity[‘score’]:.4f}”)

print(“-“ * 30)

Summary

Named Entity Recognition with BERT models provides a powerful way to extract structured information from text. The Hugging Face Transformers library makes it easy to implement NER with state-of-the-art models, whether you need a simple pipeline approach or more detailed control over the process.

In this tutorial, you learned about NER with BERT. In particular, you learned how to:

Use the pipeline API for quick prototypes and simple applications
Use explicit model handling for more control and custom processing
Consider performance optimization for production applications
Always handle edge cases and implement proper error handling

With these tools and techniques, you can build robust NER systems for various applications, from information extraction to document processing and more.

Source link