Question Answering (Q&A) is one of the signature practical applications of natural language processing. In a previous post, you have seen how to use DistilBERT for question answering by building a pipeline using the transformers library. In this post, you will deep dive into the technical details to see how you can manipulate the question for your own purpose. Specifically, you will learn:
- How to use the model to answer questions from a context
- How to interpret the model’s output
- How to build your own question-answering algorithm by leveraging the model’s output
Let’s get started.
Advanced Q&A Features with DistilBERT
Photo by Marcin Nowak. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Using DistilBERT Model for Question Answering
- Evaluating the Answer
- Other Techniques for Improving the Q&A Capability
Using DistilBERT Model for Question Answering
BERT (Bidirectional Encoder Representations from Transformers) was trained to be a general-purpose language model that can understand text. DistilBERT is a distilled version, meaning it is architecturally similar but smaller than BERT. It is 40% smaller in size and runs 60% faster, while its language understanding capabilities are 97% of those of BERT. Therefore, it is a good model for production use to get higher throughput.
There is a pre-trained model of DistilBERT in the Hugging Face model hub. It needs to be used with a specific tokenizer. In the transformers library, the DistilBERT tokenizer and the Q&A model are respectively DistilBertTokenizer
and DistilBertForQuestionAnswering
. You can load the pre-trained model and use it by following the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering import torch
# Load pre-trained model and tokenizer model_name = ‘distilbert-base-uncased-distilled-squad’ tokenizer = DistilBertTokenizer.from_pretrained(model_name) model = DistilBertForQuestionAnswering.from_pretrained(model_name)
# Define a context and a question question = “What is machine learning?” context = “”“Machine learning is a field of inquiry devoted to understanding and building methods that “learn‘, that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.”””
# Tokenize the input and run the model inputs = tokenizer(question, context, return_tensors=”pt“) with torch.no_grad(): outputs = model(**inputs)
# Process the answer answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits) answer_tokens = inputs.input_ids[0, answer_start: answer_end + 1] answer = tokenizer.decode(answer_tokens)
print(f“Question: {question}”) print(f“Answer: {answer}”) |
You created the tokenizer and the model using the from_pretrained()
method. This will download the model from the model hub and create the objects. You defined the question and the context as Python strings. But since the model as a neural network accepts numerical tensors, you need to use a tokenizer to convert the strings into integer “tokens”, which can be understood by the model.
You pass on both the question and the context to the tokenizer. This usage is only one of the many ways the tokenizer can be used but it is what the Q&A model expects. The tokenizer will understand the inputs as:
[CLS] question [SEP] context [SEP] |
and [CLS]
and [SEP]
are special tokens that are used to indicate the beginning and end of the subsequence.
This output is then passed on to the model, which returns an output object with attributes start_logits
and end_logits
. These are logits (log-probabilities) for where the answer is located in the context. Hence we can extract the subsequence from the sequence of input tokens, convert it back to text, and report that as the answer.
Evaluating the Answer
Recall that the model produces logits (i.e., log probabilities) for the start and end positions of the answer in the context. The way that we extract the answer in the example above is simply to take the most probable start and end positions using the argmax()
function. Not that you should not interpret the floating point values produced by the model as probabilities. In order to get the probability, you need to convert the logits using the softmax function.
An ideal model should produce the probability as binary, i.e., only one element is probability 1, and the rest are all probability 0. In practice, this is not the case. Instead, the model will produce a probability distribution with drastic contrast between one element and the rest if the model is confident about the answer, but almost a uniform distribution if the model is not confident.
Therefore, we can further interpret the logit output as the confidence score of the answer as produced by the model. Below is an example of why this is useful:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering import torch import numpy as np
# Load pre-trained model and tokenizer model_name = ‘distilbert-base-uncased-distilled-squad’ tokenizer = DistilBertTokenizer.from_pretrained(model_name) model = DistilBertForQuestionAnswering.from_pretrained(model_name)
# Define multiple contexts question = “What is deep learning?” contexts = [ “”“Machine learning is a field of inquiry devoted to understanding and building methods that “learn“, that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.”“”,
“”“Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Deep learning is behind many recent advances in AI, including computer vision and speech recognition.”“”,
“”“Natural Language Processing (NLP) is a field of AI that gives machines the ability to read, understand, and derive meaning from human languages. It’s used in applications like chatbots, translation services, and sentiment analysis.”“” ]
# Function to get answer from a single context def get_answer(question, context): inputs = tokenizer(question, context, return_tensors=“pt”)
with torch.no_grad(): outputs = model(**inputs)
# Get the most likely answer span answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits)
# Calculate the confidence score (simplified) confidence = float(outputs.start_logits[0, answer_start] + outputs.end_logits[0, answer_end])
# Extract the answer answer_tokens = inputs.input_ids[0, answer_start: answer_end + 1] answer = tokenizer.decode(answer_tokens)
return answer, confidence
# Get answers from all contexts answers_with_scores = [get_answer(question, context) for context in contexts]
# Find the answer with the highest confidence score best_answer_idx = np.argmax([score for _, score in answers_with_scores]) best_answer, best_score = answers_with_scores[best_answer_idx]
print(f“Question: {question}”) print(f“Best Answer: {best_answer}”) print(f“From Context: {contexts[best_answer_idx][:100]}…”) print(f“Confidence Score: {best_score}”) |
In the example above, instead of one question and one context, we provided a list of contexts and a single question. Each context is used to get an answer and the confidence score. This is implemented by the function get_answer()
. The score is a simple sum of the model’s predicted value for the start and end positions, assuming that the model will produce a greater value for a more confident answer. Finally, you find the one with the highest confidence score and report that as the answer.
This approach allows us to search for answers across multiple documents and return the most confident answer. However, it’s worth noting that this is a simplified approach. In a production system, you might want to use more sophisticated methods for ranking answers, such as considering the length of the answer and the position in the document or using a separate ranking model.
Other Techniques for Improving the Q&A Capability
You can easily extend the code above for a more sophisticated Q&A system, such as one that supports caching the result or processing multiple questions in a batch.
One limitation of the model used in the example above is that you cannot feed a very long context to the model. This model has a maximum sequence length of 512 tokens. If your context is longer than this, you’ll need to split it into smaller chunks.
You can create chunks naively by splitting the context at every 512 tokens, but you risk breaking the answer in the middle of a sentence. Another approach is to use a sliding window. Below is an implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering import torch import numpy as np
# Load pre-trained model and tokenizer model_name = “distilbert-base-uncased-distilled-squad” tokenizer = DistilBertTokenizer.from_pretrained(model_name) model = DistilBertForQuestionAnswering.from_pretrained(model_name)
# Define a long context question = “What is the capital of France?” long_context = “”“Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres. The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.”“”
def get_answer_sliding_window(question, context, total_len=512, stride=128): “”“Function to get answer using sliding window”“” # Tokenize the question and context question_tokens = tokenizer.tokenize(question) context_tokens = tokenizer.tokenize(context)
# If the context is short enough, process it directly if len(question_tokens) + len(context_tokens) + 3 <= total_len: # +3 for [CLS], [SEP], [SEP] best_answer, best_score = get_answer(question, context) return best_answer, best_score, context
# Otherwise, use sliding window max_question_len = 64 # Limit question length to ensure we have enough space for context if len(question_tokens) > max_question_len: question_tokens = question_tokens[:max_question_len]
# Calculate how many tokens we can allocate to the context max_len = total_len – len(question_tokens) – 3 # -3 for [CLS], [SEP], [SEP] windows = [] for i in range(0, len(context_tokens), stride): windows.append(tokenizer.convert_tokens_to_string(context_tokens[i:i+max_len])) if i + max_len >= len(context_tokens): break # Last window
# Get answers from all windows answers_with_scores = [get_answer(question, window) for window in windows]
# Find the answer with the highest confidence score best_answer_idx = np.argmax([score for _, score in answers_with_scores]) best_answer, best_score = answers_with_scores[best_answer_idx] return best_answer, best_score, windows[best_answer_idx]
def get_answer(question, context): “”“Function to get answer from a single context”“” inputs = tokenizer(question, context, return_tensors=“pt”)
with torch.no_grad(): outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits)
confidence = float(outputs.start_logits[0, answer_start] + outputs.end_logits[0, answer_end]) answer_tokens = inputs.input_ids[0, answer_start: answer_end + 1] answer = tokenizer.decode(answer_tokens) return answer, confidence
# Get answer using sliding window best_answer, best_score, best_window = get_answer_sliding_window(question, long_context)
print(f“Question: {question}”) print(f“Best Answer: {best_answer}”) print(f“From Window: {best_window[:100]}…”) print(f“Confidence Score: {best_score}”) |
This code implements the function get_answer_sliding_window()
that will split the context into shorter pieces if it is too long. Each split to maintain the maximum total number of tokens of the question and the context.
The split is done by a sliding window, and each split will move the window the size of stride, which is defaulted to 128. In other words, each subsequent window will discard 128 tokens from the left and add 128 tokens from the right. Given the total length of 512, there should be significant overlap between the windows so that the answer should not be fragmented in at least one of the windows.
Answer if found using the model with context from each window. The best answer is reported based on the confidence score. This way, the context can be of arbitrary length, thanks to the availability of the tokenizer as an independent object that you can use to encode and decode text.
Another way to improve the Q&A capability is to use not one but multiple models. This is the idea of ensemble methods. One simple approach is to run the question twice, each from a different model, and pick the answer with the highest score. An example is shown below, which uses the original BERT model as the second model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering, \ BertTokenizer, BertForQuestionAnswering import torch
# Load DistilBERT model and tokenizer distilbert_model_name = “distilbert-base-uncased-distilled-squad” distilbert_tokenizer = DistilBertTokenizer.from_pretrained(distilbert_model_name) distilbert_model = DistilBertForQuestionAnswering.from_pretrained(distilbert_model_name)
# Load BERT model and tokenizer bert_model_name = “bert-large-uncased-whole-word-masking-finetuned-squad” bert_tokenizer = BertTokenizer.from_pretrained(bert_model_name) bert_model = BertForQuestionAnswering.from_pretrained(bert_model_name)
# Define a context and a question question = “What is the capital of France?” context = “”“Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres. The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.”“”
# Function to get answer from DistilBERT def get_distilbert_answer(question, context): inputs = distilbert_tokenizer(question, context, return_tensors=“pt”)
with torch.no_grad(): outputs = distilbert_model(**inputs)
start = torch.argmax(outputs.start_logits) end = torch.argmax(outputs.end_logits)
confidence = float(outputs.start_logits[0, start] + outputs.end_logits[0, end]) answer_tokens = inputs.input_ids[0, start:end+1] answer = distilbert_tokenizer.decode(answer_tokens)
return answer, confidence
# Function to get answer from BERT def get_bert_answer(question, context): inputs = bert_tokenizer(question, context, return_tensors=“pt”)
with torch.no_grad(): outputs = bert_model(**inputs)
start = torch.argmax(outputs.start_logits) end = torch.argmax(outputs.end_logits)
confidence = float(outputs.start_logits[0, start] + outputs.end_logits[0, end])
answer_tokens = inputs.input_ids[0, start:end+1] answer = bert_tokenizer.decode(answer_tokens)
return answer, confidence
# Get answers from both models distilbert_answer, distilbert_confidence = get_distilbert_answer(question, context) bert_answer, bert_confidence = get_bert_answer(question, context)
# Simple ensemble: choose the answer with the highest confidence if distilbert_confidence > bert_confidence: final_answer = distilbert_answer model_used = “DistilBERT” confidence = distilbert_confidence else: final_answer = bert_answer model_used = “BERT” confidence = bert_confidence
print(f“Question: {question}”) print(f“Final Answer: {final_answer}”) print(f“Model Used: {model_used}”) print(f“Confidence Score: {confidence}”) |
This code instantiated two tokenizers and two models; one is DistilBERT, and the other is BERT. The functions get_distilbert_answer()
and get_bert_answer()
are used to get the answer from the respective model. Both functions are invoked for the provided question and context. The final answer is the one with the highest confidence score.
Ensemble methods can improve accuracy by leveraging the strengths of different models and mitigating their individual weaknesses. The above is just one way to combine models. There are other approaches to use with ensemble methods. For example, with more models, you can use voting to choose the most frequently occurring answer. You can also assign weights to each model and take the weighted average of their output, from which the answer is derived. A more complicated approach is stacking, where you train a meta-model to combine the predictions of the base models. This generalized the weighted average approach but increased the computational complexity.
Further Readings
Below are some further readings that you may find useful:
Summary
In this post, you have explored how to use DistilBERT for advanced question-answering tasks. In particular, you have learned:
- How to use DistilBERT’s tokenizer and the Q&A model directly
- How to interpret the Q&A model’s output and extract the answer
- How you can make use of the model’s raw output to build a more sophisticated Q&A system