ReALM: Reference Resolution with Language Modelling

Apr 3, 2024

Understanding context is crucial for any conversational AI system. This includes not only the flow of conversation but also elements present on the user’s screen or running in the background. While large language models (LLMs) have shown impressive capabilities in various tasks, their application in reference resolution, particularly for non-conversational entities, remains underexplored.

This blog delves into ReALM, a novel approach that leverages LLMs to resolve references of various types, including on-screen and conversational entities. ReALM tackles the challenge of encoding non-textual entities for LLM processing, achieving significant performance improvements over existing systems.

Why ReALM?

Traditional reference resolution systems often struggle with on-screen references, which are crucial for hands-free interaction with voice assistants. Existing methods also rely on hand-crafted rules and struggle to scale to new entity types. ReALM addresses these limitations by:

Utilizing LLMs: ReALM leverages the power of LLMs to understand complex contexts and semantic relationships.
Encoding On-Screen Entities: ReALM converts on-screen entities and their locations into a textual representation, allowing LLMs to “see” the screen context.
Joint Resolution: ReALM effectively handles both conversational and on-screen references, providing a unified solution for context understanding.

How ReALM Works

ReALM employs a pipeline approach where an LLM is fine-tuned specifically for reference resolution. The input data, consisting of user queries and corresponding entities, is converted into a sentence-wise format suitable for LLM training.

ReALM Benchmarks and Architecture Details

Benchmarking ReALM

ReALM was evaluated on three main datasets:

Conversational: This dataset contains user queries and relevant entities collected through a process where graders were shown screenshots with synthetic lists of entities and asked to provide queries that unambiguously reference specific entities.
Synthetic: This dataset was generated using templates to create type-based references where the user query and entity type are sufficient to resolve the reference.
On-screen: This dataset was collected from web pages containing phone numbers, email addresses, and physical addresses. Annotators provided queries based on the screens and identified the entities referred to in each query.

The performance of ReALM was compared to two baseline models:

MARRS: A non-LLM based reference resolution system.
ChatGPT (GPT-3.5 and GPT-4): Large language models with in-context learning capabilities.

The results are summarized in the following table:

Model	Conversational	Synthetic	On-screen	Unseen
MARRS	92.1%	99.4%	83.5%	84.5%
GPT-3.5	84.1%	34.2%	74.1%	67.5%
GPT-4	97.0%	58.7%	90.1%	98.4%
ReALM-80M	96.7%	99.5%	88.9%	99.3%
ReALM-250M	97.8%	99.8%	90.6%	97.2%
ReALM-1B	97.9%	99.7%	91.4%	94.8%
ReALM-3B	97.9%	99.8%	93.0%	97.8%

As shown, ReALM outperforms MARRS across all datasets and achieves comparable performance to GPT-4, despite being a significantly smaller model. Notably, ReALM shows significant improvements on the on-screen dataset, demonstrating its effectiveness in resolving references to visual elements.

Architecture Details

ReALM utilizes a FLAN-T5 model, a type of Transformer-based language model, which is fine-tuned specifically for the task of reference resolution. The input to the model consists of the user query and the encoded representations of the candidate entities.

The encoding process differs for conversational and on-screen entities:

Conversational entities: These are encoded by combining their type and relevant properties, such as name, address, or time.
On-screen entities: These are encoded using a novel algorithm that converts their locations and surrounding text elements into a textual representation that preserves the spatial layout of the screen.

The model is then trained to predict the relevant entity (or entities) based on the user query and the encoded entity information.

Encoding Conversational References

ReALM handles two types of conversational references:

Type-based: These references rely on the user query and entity types to identify relevant entities. For example, “play this” implies a song or movie, while “call him” refers to a phone number or contact.
Descriptive: These references use specific properties of the entity to identify it, such as “the one in Times Square.”

ReALM encodes both the type and properties of each entity, allowing the LLM to learn the relationships between them and user queries.

Encoding On-Screen References

ReALM utilizes a novel algorithm to encode on-screen entities into text. The algorithm assumes the location of entities and surrounding objects can be represented by the center of their bounding boxes. These centers are then sorted vertically and horizontally, effectively encoding the screen layout in a left-to-right, top-to-bottom fashion. This textual representation allows the LLM to understand the spatial relationships between entities on the screen.

Future Directions

While ReALM effectively encodes the positions of on-screen entities, it can be further improved by exploring more complex spatial representations. Additionally, investigating methods to combine textual and visual information directly within the LLM framework could lead to even more robust reference resolution capabilities.

ReALM demonstrates the potential of LLMs in reference resolution, paving the way for more natural and intuitive interactions with conversational AI systems. By bridging the gap between textual and non-textual entities, ReALM opens doors to exciting possibilities in the field of context understanding and human-machine interaction.

To read more paper like this checkout this page