Tips for measuring text-based generative models using BLEU, ROUGE, METEOR, and Bertscore as well as prediction embeddings
In recent years, text-based generative AI models have been making significant strides in natural language processing tasks such as language translation, text summarization, and dialogue generation. These models are capable of generating text that is often indistinguishable from human-generated text, making them increasingly popular in various industries, including customer service, content generation, and data analysis. While these models can be incredibly powerful and useful, they can also produce unexpected or even harmful output, making it critical to monitor them closely.

For example, consider a chatbot that is designed to help customers with their queries. If the model is not monitored, it could generate inappropriate or unhelpful responses, damaging the reputation of the company that deployed it. Therefore, it is essential to monitor these models’ performance regularly to ensure that they are producing accurate and unbiased results. In this article, we will deep dive on how to monitor text-based generative models using performance metrics such as BLEU, ROUGE, METEOR scores, and prediction embeddings.
Monitoring Generative Models with Reference Text
In order to evaluate the performance of machine-generated text, a reference text or ground truth is used for comparison. This reference text is what is expected from the generative model to produce ideally and usually collected from human domain experts. In the case that the reference text exists as models generate prompts, there are different metrics to compute performance. Let’s try to understand the different types of performance metrics for generative models with real-life examples in Python.
BLEU Score: Bilingual Evaluation Understudy
BLEU is a precision-focused metric that measures the n-gram overlap between the generated text and the reference text. The score also considers a brevity penalty where a penalty is applied when the machine-generated text is too short compared to reference text. It is a metric that is generally used for machine translation performance. The score ranges from 0 to 1, with higher scores indicating greater similarity between the generated text and the reference text.

    from nltk.translate.bleu_score import sentence_bleu
    reference = [['this', 'movie', 'was', 'awesome']]
    candidate = ['this', 'movie', 'was', 'awesome', 'too']
    score = sentence_bleu(reference, candidate)
    print(score)
    0.668740304976422ROUGE Score: Recall Oriented Understudy for Gisting Evaluation Score
ROUGE is a metric that measures the overlap between the generated text and the reference text in terms of recall. Rouge comes in three types: rouge-n, the most prevalent form that detects n-gram overlap; rouge-l, which identifies the Longest Common Subsequence and rouge-s, which concentrates on skip grams. n-rouge is the most frequently used type with the following formula:

The following code demonstrates how to calculate the rouge-2 score in Python:
    from rouge import Rouge
    reference = 'this movie was awesome'
    candidate = 'this movie was awesome too'
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)[0]
    ['rouge-2']
    ['f']
    print(scores)
    0.8571428522448981The main difference between rouge and bleu is that bleu score is precision-focused whereas rouge score focuses on recall.
METEOR Score: Metric for Evaluation of Translation with Explicit Ordering
METEOR is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. You can check out the algorithm behind METEOR here.
The following code demonstrates how to calculate the METEOR score using the NLTK library in Python:
    from nltk.translate.meteor_score import single_meteor_score
    reference = ['this', 'movie', 'was', 'awesome']
    candidate = ['this', 'movie', 'was', 'awesome', 'too']
    score = single_meteor_score(reference, candidate)
    print(score)
    0.9679978048780488While the main difference between rouge and bleu is that bleu score is precision-focused and ROUGE score focuses on recall, the METEOR metric on the other hand was designed to fix some of the problems found in the more popular BLEU and ROUGE metrics and also produce good correlation with human judgment at the sentence or segment level.
BERT Score
One main disadvantage of using metrics such as BLEU or ROUGE is the fact that the performance of text generation models are dependent on exact matches. Exact matches might be important for use-cases like machine translation, however for generative AI models that try to create meaningful and similar texts to corpus data, exact matches might not be very accurate.

Hence, instead of exact matches, BERTScore is focused on the similarity between reference and generated text by using contextual embeddings. The main idea behind contextual embeddings is to understand the meaning behind the reference and candidate text respectively and then compare those meanings.
The following code demonstrates how to calculate the BERT score using the bert_score library in Python:
    import torch
    from bert_score import score
    # reference and generated texts
    ref_text = "The quick brown fox jumps over the lazy dog."
    gen_text = "A fast brown fox leaps over a lazy hound."
    # compute Bert score
    P, R, F1 = score([gen_text], [ref_text], lang="en", model_type="bert-base-uncased")
    # print results
    print(f"Bert score: P={P.item():.4f} R={R.item():.4f} F1={F1.item():.4f}")Monitoring Generative Models without Reference Text
When generative models are generating text without any reference, it can be challenging to monitor the models since performance metrics such as ROUGE or METEOR can not be computed. However, just like non-generative models, proxy metrics such as drift can be used to monitor generative models. In this case, since the models’ outputs are text, text embeddings can be leveraged to track the change in predictions. These embeddings provide a representation of the model’s output and can be used to compare different outputs to identify changes in the model’s behavior over time.
Specifically, euclidean distance between prediction embeddings can be computed in order to track model change over time. However, just simply tracking model drift might not be enough to improve the model performance and make sure model behavior is consistent. As an additional step, the computed prediction embeddings can be visualized in a lower-dimensional space where predictions inside similar clusters would suggest similar semantic meaning. If there are any outlier points inside the lower-dimensional space, those points can be analyzed and might be even used for re-training purposes. Finally, with embedding visualizations, machine learning engineers can also block certain clusters of words so that generative models are not biased.
To demonstrate the use of prediction embeddings, let’s consider an example of a language model trained on a dataset of news articles. Suppose we have a model that produces an article about politics, and we want to compare its output to another article produced by the same model six months later.
First, we can use the transformer library to tokenize the two articles:
    import numpy as np
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    article1 = "The government announced new policies on climate change."
    article2 = "The government released a statement on their new policies for combating climate change."
    inputs1 = tokenizer.encode(article1, return_tensors="pt")
    inputs2 = tokenizer.encode(article2, return_tensors="pt")Next, we can generate the embeddings for the two articles using the model’s output:
    with torch.no_grad():
    outputs1 = model(inputs1)
    embeddings1 = np.array(outputs1.last_hidden_state.mean(axis=1).squeeze())
    outputs2 = model(inputs2)
    embeddings2 = np.array(outputs2.last_hidden_state.mean(axis=1).squeeze())Finally, we can use the mean of the model’s last hidden state as the embedding for each article. We can then calculate the euclidean distance between the two embeddings to compare the two articles:
    from scipy.spatial.distance import euclidean
    distance = euclidean(embeddings1, embeddings2)
    print(distance)Next Steps of Generative Model Monitoring
Even though traditional metrics such as BLEU, ROUGE, or METEOR can be promising for model performance monitoring, using large language models (LLMs) such as BERT is expected to be more common in the next few years. Using LLMs to evaluate LLMs on complex tasks is an emerging area of research that aims to enhance the performance of language models. The use of LLMs for evaluation can be advantageous since they can capture complex patterns and dependencies within large datasets that traditional evaluation metrics may overlook. Additionally, LLMs can be trained on a wide range of tasks, which can aid in the evaluation of other LLMs across multiple domains.
Conclusion
In conclusion, monitoring of text-based generative AI models is a crucial process that ensures their performance and fairness over time. Using performance metrics such as BLEU, ROUGE, and METEOR scores, we can evaluate the quality of the model’s output and track changes in its behavior. Additionally, prediction embeddings are a valuable tool for identifying drift and monitoring embedding drift, which can help improve the model’s accuracy and fairness. However, there are limitations to generative AI model monitoring, and additional measures such as diverse training data and regular retraining may be necessary to ensure model performance. Overall, by incorporating monitoring techniques and best practices, we can ensure the continued success of text-based generative AI models in a variety of applications, from chatbots to content generation and beyond.

