12 RAG Pain Points and Solutions

Know Early AI Trends!

Sign-up to get Trends and Tools related to AI directly to your inbox

We don’t spam!

  • Pain Point 1: Missing Content
  • Pain Point 2: Missed the Top Ranked Documents
  • Pain Point 3: Not in Context — Consolidation Strategy Limitations
  • Pain Point 4: Not Extracted
  • Pain Point 5: Wrong Format
  • Pain Point 6: Incorrect Specificity
  • Pain Point 7: Incomplete
  • Pain Point 8: Data Ingestion Scalability
  • Pain Point 9: Structured Data QA
  • Pain Point 10: Data Extraction from Complex PDFs
  • Pain Point 11: Fallback Model(s)
  • Pain Point 12: LLM Security

Inspired by the paper “Seven Failure Points When Engineering a Retrieval Augmented Generation System” by Barnett et al., in this article, we will explore the seven pain points discussed in the paper and five additional common pain points in creating an RAG pipeline. Our focus is on understanding these pain points and their proposed solutions to effectively address them in our daily RAG development.

I use the term “pain points” instead of “failure points” as these issues all have corresponding suggested remedies. Let’s attempt to resolve these pain points before they become failures within our RAG pipelines.

First, let’s examine the seven pain points presented in Barnett et al.’s paper. (See diagram below.) We will then add the five additional pain points and their suggested solutions.

Pain Point 1: Missing Content

Context missing in the knowledge base. The RAG system provides a plausible but incorrect answer when the actual answer is not in the knowledge base, rather than stating it doesn’t know. Users receive misleading information, leading to frustration.

We have two proposed solutions:

Clean your data

The RAG system may generate plausible but incorrect answers when the actual answer is not in the knowledge base, leading to misinformation for users. To address this issue, it’s crucial to focus on data quality before building your RAG pipeline. This approach is not limited to the pain points mentioned here but applicable to all.

To clean your data:
1. Remove noise and irrelevant information: Eliminate special characters, stop words (common words like “the” and “a”), and HTML tags.
2. Identify and correct errors: Correct spelling mistakes, typos, and grammatical errors using tools like spell checkers and language models.
3. Deduplicate records: Remove duplicate or similar records that could potentially bias the retrieval process.

For more data cleaning functionalities, you can refer to Unstructured.io and their core library offering cleaning capabilities.

Better prompting

Improve prompting to help the system provide more accurate answers in situations where it might otherwise give a plausible but incorrect response due to insufficient information in the knowledge base. Encourage the model to acknowledge its limitations and communicate uncertainty by instructing with prompts like “Tell me if you don’t know when uncertain.” While we cannot guarantee 100% accuracy, better prompting is a valuable step towards improving the system’s performance after cleaning your data.

Pain Point 2: Missed the Top Ranked Documents

Context missing in the initial retrieval pass. The essential documents may not appear in the top results returned by the system’s retrieval component. The correct answer is overlooked, causing the system to fail to deliver accurate responses. The paper hinted, “The answer to the question is in the document but did not rank highly enough to be returned to the user”.

Two proposed solutions came to my mind:

Hyperparameter tuning for chunk_size and `similarity_top_k`


Adjusting parameters for chunk_size and similarity_top_k

The parameters chunk_size and similarity_top_k are crucial in balancing the retrieval process’s efficiency and the quality of the information retrieved in RAG models. Modifying these parameters affects the balance between computational speed and the accuracy of the information retrieved. Refer to the example code provided below.

    param_tuner = ParamTuner(  
        param_fn=objective_function_semantic_similarity,  
        param_dict=param_dict,  
        fixed_param_dict=fixed_param_dict,  
        show_progress=True,  
    )  
      
    results = param_tuner.tune()

The function objective_function_semantic_similarity is defined as follows, with param_dict containing the parameters, chunk_size and top_k, and their corresponding proposed values:

    # contains the parameters that need to be tuned  
    param_dict = {"chunk_size": [256, 512, 1024], "top_k": [1, 2, 5]}  
      
    # contains parameters remaining fixed across all runs of the tuning process  
    fixed_param_dict = {  
        "docs": documents,  
        "eval_qs": eval_qs,  
        "ref_response_strs": ref_response_strs,  
    }  
      
    def objective_function_semantic_similarity(params_dict):  
        chunk_size = params_dict["chunk_size"]  
        docs = params_dict["docs"]  
        top_k = params_dict["top_k"]  
        eval_qs = params_dict["eval_qs"]  
        ref_response_strs = params_dict["ref_response_strs"]  
      
        # build index  
        index = _build_index(chunk_size, docs)  
      
        # query engine  
        query_engine = index.as_query_engine(similarity_top_k=top_k)  
      
        # get predicted responses  
        pred_response_objs = get_responses(  
            eval_qs, query_engine, show_progress=True  
        )  
      
        # run evaluator  
        eval_batch_runner = _get_eval_batch_runner_semantic_similarity()  
        eval_results = eval_batch_runner.evaluate_responses(  
            eval_qs, responses=pred_response_objs, reference=ref_response_strs  
        )  
      
        # get semantic similarity metric  
        mean_score = np.array(  
            [r.score for r in eval_results["semantic_similarity"]]  
        ).mean()  
      
        return RunResult(score=mean_score, params=params_dict)

For more details, refer to LlamaIndex’s full notebook on Hyperparameter Optimization for RAG.

Reranking

Reranking retrieval results before sending them to the LLM has significantly improved RAG performance. This LlamaIndex notebook demonstrates the difference between:

* Inaccurate retrieval by directly retrieving the top 2 nodes without a reranker.

* Accurate retrieval by retrieving the top 10 nodes and using `CohereRerank` to rerank and return the top 2 nodes.

    import os  
    from llama_index.postprocessor.cohere_rerank import CohereRerank  
      
    api_key = os.environ["COHERE_API_KEY"]  
    cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker  
      
    query_engine = index.as_query_engine(  
        similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval  
        node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors  
    )  
      
    response = query_engine.query(  
        "What did Sam Altman do in this essay?",  
    )

In addition, you can evaluate and enhance retriever performance using various embeddings and rerankers, as detailed in Boosting RAG: Picking the Best Embedding & Reranker models

Moreover, you can finetune a custom reranker to get even better retrieval performance, and the detailed implementation is documented in Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex

Pain Point 3: Not in Context — Consolidation Strategy Limitations

Context missing after reranking. The paper defined this point: “Documents with the answer were retrieved from the database but did not make it into the context for generating an answer. This occurs when many documents are returned from the database, and a consolidation process takes place to retrieve the answer”.

In addition to adding a reranker and finetuning the reranker as described in the above section, we can explore the following proposed solutions:

Tweak retrieval strategies

LlamaIndex offers an array of retrieval strategies, from basic to advanced, to help us achieve accurate retrieval in our RAG pipelines. Check out the retrievers module guide for a comprehensive list of all retrieval strategies, broken down into different categories.

* Basic retrieval from each index

* Advanced retrieval and search

* Auto-Retrieval

* Knowledge Graph Retrievers

* Composed/Hierarchical Retrievers

* and more!

Finetune embeddings

If you use an open-source embedding model, finetuning your embedding model is a great way to achieve more accurate retrievals. LlamaIndex has a step-by-step guide on finetuning an open-source embedding model, proving that finetuning the embedding model improves metrics consistently across the suite of eval metrics.

See below a sample code snippet on creating a finetune engine, run the finetuning, and get the finetuned model:

    finetune_engine = SentenceTransformersFinetuneEngine(  
        train_dataset,  
        model_id="BAAI/bge-small-en",  
        model_output_path="test_model",  
        val_dataset=val_dataset,  
    )  
      
    finetune_engine.finetune()  
      
    embed_model = finetune_engine.get_finetuned_model()

Pain Point 4: Not Extracted

Context not extracted. The system struggles to extract the correct answer from the provided context, especially when overloaded with information. Key details are missed, compromising the quality of responses. The paper hinted: “This occurs when there is too much noise or contradicting information in the context”.

Let’s explore three proposed solutions:

Clean your data

This pain point is yet another typical victim of bad data. We cannot stress enough the importance of clean data! Do spend time cleaning your data first before blaming your RAG pipeline.

Prompt Compression

Prompt compression in the long-context setting was introduced in the [LongLLMLingua research project/paper](https://arxiv.org/abs/2310.06839). With its integration in LlamaIndex, we can now implement LongLLMLingua as a node postprocessor, which will compress context after the retrieval step before feeding it into the LLM. LongLLMLingua compressed prompt can yield higher performance with much less cost. Additionally, the entire system runs faster.

See the sample code snippet below, where we set up LongLLMLinguaPostprocessor, which uses the longllmlingua package to run prompt compression.

For more details, check out the full notebook on LongLLMLingua.

    from llama_index.core.query_engine import RetrieverQueryEngine  
    from llama_index.core.response_synthesizers import CompactAndRefine  
    from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor  
    from llama_index.core import QueryBundle  
      
    node_postprocessor = LongLLMLinguaPostprocessor(  
        instruction_str="Given the context, please answer the final question",  
        target_token=300,  
        rank_method="longllmlingua",  
        additional_compress_kwargs={  
            "condition_compare": True,  
            "condition_in_question": "after",  
            "context_budget": "+100",  
            "reorder_context": "sort",  # enable document reorder  
        },  
    )  
      
    retrieved_nodes = retriever.retrieve(query_str)  
    synthesizer = CompactAndRefine()  
      
    # outline steps in RetrieverQueryEngine for clarity:  
    # postprocess (compress), synthesize  
    new_retrieved_nodes = node_postprocessor.postprocess_nodes(  
        retrieved_nodes, query_bundle=QueryBundle(query_str=query_str)  
    )  
      
    print("\n\n".join([n.get_content() for n in new_retrieved_nodes]))  
      
    response = synthesizer.synthesize(query_str, new_retrieved_nodes)

LongContextReorder

A study observed that the best performance typically arises when crucial data is positioned at the start or conclusion of the input context. `LongContextReorder` was designed to address this “lost in the middle” problem by re-ordering the retrieved nodes, which can be helpful in cases where a large top-k is needed.

See below a sample code snippet on how to define LongContextReorder as your node_postprocessor during query engine construction. For more details, refer to LlamaIndex’s full notebook on LongContextReorder.

    from llama_index.core.postprocessor import LongContextReorder  
      
    reorder = LongContextReorder()  
      
    reorder_engine = index.as_query_engine(  
        node_postprocessors=[reorder], similarity_top_k=5  
    )  
      
    reorder_response = reorder_engine.query("Did the author meet Sam Altman?")

Pain Point 5: Wrong Format

Output is in wrong format. When an instruction to extract information in a specific format, like a table or list, is overlooked by the LLM, we have four proposed solutions to explore:

Better prompting

There are several strategies you can employ to improve your prompts and rectify this issue:

* Clarify the instructions.

* Simplify the request and use keywords.

* Give examples.

* Iterative prompting and asking follow-up questions.

Output parsing

Output interpretation can be applied in these manners to ensure the expected outcome:

  • to provide instructions for formatting for any command/request
  • to provide “interpretation” for LLM responses

LlamaIndex supports connections with interpretation modules offered by other systems, such as Guardrails and LangChain.

See below a sample code fragment of LangChain’s interpretation modules that you can utilize within LlamaIndex. For further information, visit LlamaIndex documentation on interpretation modules.


    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader  
    from llama_index.core.output_parsers import LangchainOutputParser  
    from llama_index.llms.openai import OpenAI  
    from langchain.output_parsers import StructuredOutputParser, ResponseSchema  
      
    # load documents, build index  
    documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()  
    index = VectorStoreIndex.from_documents(documents)  
      
    # define output schema  
    response_schemas = [  
        ResponseSchema(  
            name="Education",  
            description="Describes the author's educational experience/background.",  
        ),  
        ResponseSchema(  
            name="Work",  
            description="Describes the author's work experience/background.",  
        ),  
    ]  
      
    # define output parser  
    lc_output_parser = StructuredOutputParser.from_response_schemas(  
        response_schemas  
    )  
    output_parser = LangchainOutputParser(lc_output_parser)  
      
    # Attach output parser to LLM  
    llm = OpenAI(output_parser=output_parser)  
      
    # obtain a structured response  
    query_engine = index.as_query_engine(llm=llm)  
    response = query_engine.query(  
        "What are a few things the author did growing up?",  
    )  
    print(str(response))

Pydantic programs

A Pydantic application acts as a flexible framework that transforms an input string into a structured Pydantic entity. LlamaIndex offers several types of Pydantic applications:

  • LLM Text Completion Pydantic Applications: These applications process input text and convert it into a structured entity as defined by the user, making use of a text completion API along with output interpretation.
  • LLM Function Calling Pydantic Applications: These applications accept input text and transform it into a structured entity according to the user’s specifications, by employing an LLM function calling API.
  • Prepackaged Pydantic Applications: These are crafted to convert input text into predefined structured entities.

View the following code example from the OpenAI pydantic application. For additional information, refer to LlamaIndex’s documentation on the pydantic application for links to the notebooks/guides of the various pydantic applications.


    from pydantic import BaseModel  
    from typing import List  
      
    from llama_index.program.openai import OpenAIPydanticProgram  
      
    # Define output schema (without docstring)  
    class Song(BaseModel):  
        title: str  
        length_seconds: int  
      
      
    class Album(BaseModel):  
        name: str  
        artist: str  
        songs: List[Song]  
      
    # Define openai pydantic program  
    prompt_template_str = """\  
    Generate an example album, with an artist and a list of songs. \  
    Using the movie {movie_name} as inspiration.\  
    """  
    program = OpenAIPydanticProgram.from_defaults(  
        output_cls=Album, prompt_template_str=prompt_template_str, verbose=True  
    )  
      
    # Run program to get structured output  
    output = program(  
        movie_name="The Shining", description="Data model for an album."  
    )

OpenAI JSON mode

OpenAI JSON mode permits us to configure response_format as { “type”: “json_object” }, enabling JSON mode for the reply. When JSON mode is activated, the model is restricted to solely produce strings that are interpretable as valid JSON objects. Although JSON mode mandates the output’s format, it does not aid in validation against a designated schema. For further information, visit LlamaIndex’s documentation on OpenAI JSON Mode vs. Function Calling for Data Extraction.

Pain Point 6: Incorrect Specificity

Output has incorrect level of specificity. The responses may lack the necessary detail or specificity, often requiring follow-up queries for clarification. Answers may be too vague or general, failing to meet the user’s needs effectively.

We turn to advanced retrieval strategies for solutions.

Advanced retrieval strategies

When the responses aren’t at the desired level of granularity, you can enhance your retrieval methods. Some key advanced retrieval strategies that may help address this issue include:

Pain Point 7: Incomplete

Output is incomplete. Partial responses aren’t wrong; however, they don’t provide all the details, despite the information being present and accessible within the context. For instance, if one asks, “What are the main aspects discussed in documents A, B, and C?” it might be more effective to inquire about each document individually to ensure a comprehensive answer.

Query transformations

Comparison questions often perform inadequately with basic RAG (Retrieval-Augmented Generation) methods. An effective strategy to enhance the deductive abilities of RAG involves incorporating a query comprehension layer — implement query alterations prior to the search within the vector database. Below are four distinct query modification techniques:

  • Routing: Preserve the original query while identifying the specific subset of instruments it relates to. Subsequently, assign these instruments as the relevant choices.
  • Query-Rewriting: Keep the chosen instruments, yet rephrase the query in various ways to utilize it across the identical set of instruments.
  • Sub-Questions: Segment the query into multiple smaller queries, each aimed at different instruments as dictated by their metadata.
  • ReAct Agent Tool Selection: Based on the initial query, select which instrument to employ and devise the precise query to execute on that instrument.

Below is a sample code excerpt on employing HyDE (Hypothetical Document Embeddings), a technique for query-rewriting. With a natural language query, a hypothetical document/answer is first crafted. This hypothetical document is then leveraged for embedding retrieval instead of the direct query.

    from pydantic import BaseModel  
    from typing import List  
      
    from llama_index.program.openai import OpenAIPydanticProgram  
      
    # Define output schema (without docstring)  
    class Song(BaseModel):  
        title: str  
        length_seconds: int  
      
      
    class Album(BaseModel):  
        name: str  
        artist: str  
        songs: List[Song]  
      
    # Define openai pydantic program  
    prompt_template_str = """\  
    Generate an example album, with an artist and a list of songs. \  
    Using the movie {movie_name} as inspiration.\  
    """  
    program = OpenAIPydanticProgram.from_defaults(  
        output_cls=Album, prompt_template_str=prompt_template_str, verbose=True  
    )  
      
    # Run program to get structured output  
    output = program(  
        movie_name="The Shining", description="Data model for an album."  
    )

Review LlamaIndex’s Query Transform Cookbook for comprehensive insights.

The aforementioned challenges are all documented in the study. Next, we will investigate five more common issues found in RAG development, along with their recommended remedies.

Pain Point 8: Data Ingestion Scalability

Data ingestion pipeline faces challenges in scaling up for higher data volumes. The term “data ingestion scalability issue” in a Retrieval-Augmented Generation (RAG) pipeline denotes the difficulties encountered when the system is unable to process and handle large amounts of data efficiently, which can lead to performance constraints and the risk of system breakdown. These issues with scalability in data ingestion can result in extended ingestion durations, system overburden, concerns with data integrity, and restricted system availability.

Parallelizing ingestion pipeline

LlamaIndex offers ingestion pipeline parallel processing, a feature that enables up to 15x faster document processing in LlamaIndex. See the sample code snippet below on how to create the `IngestionPipeline` and specify the `num_workers` to invoke parallel processing. Check out LlamaIndex’s full notebook for more details.

    # load documents, build index  
    documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()  
    index = VectorStoreIndex(documents)  
      
    # run query with HyDE query transform  
    query_str = "what did paul graham do after going to RISD"  
    hyde = HyDEQueryTransform(include_original=True)  
    query_engine = index.as_query_engine()  
    query_engine = TransformQueryEngine(query_engine, query_transform=hyde)  
      
    response = query_engine.query(query_str)  
    print(response)

Pain Point 9: Structured Data QA

Interpreting user requests to extract relevant structured data accurately poses challenges, particularly with complex or unclear queries, rigid text-to-SQL translations, and the current limitations of LLMs in effectively managing these tasks.

LlamaIndex introduces two solutions.

Pack of Chain-of-Table

The ChainOfTablePack is a LlamaPack inspired by the groundbreaking “chain-of-table” study by Wang and colleagues. The “chain-of-table” method combines the chain-of-thought concept with transformations and representations of tables. It sequentially modifies tables through a restricted set of operations, showcasing the altered tables to the LLM at each phase. This approach’s key benefit is its capability to tackle questions concerning complex table cells filled with various pieces of information by systematically segmenting and refining the data until the correct subsets are pinpointed, thereby improving the accuracy of tabular QA.

For instructions on how to employ ChainOfTablePack for querying your structured data, visit LlamaIndex’s comprehensive notebook.

Pack for Mix-Self-Consistency

LLMs are equipped to analyze tabular data through two primary methodologies:

  • Textual analysis via direct prompts
  • Symbolic analysis via program creation (e.g., Python, SQL, etc.)

Drawing from the study Reevaluating Understanding of Tabular Data with Large Language Models by Liu and team, LlamaIndex crafted the MixSelfConsistencyQueryEngine, which combines findings from both textual and symbolic analyses using a self-consistency technique (i.e., consensus voting) to attain SoTA results. Below is an example code snippet. For comprehensive information, explore LlamaIndex’s complete notebook.


    download_llama_pack(  
        "MixSelfConsistencyPack",  
        "./mix_self_consistency_pack",  
        skip_load=True,  
    )  
      
    query_engine = MixSelfConsistencyQueryEngine(  
        df=table,  
        llm=llm,  
        text_paths=5, # sampling 5 textual reasoning paths  
        symbolic_paths=5, # sampling 5 symbolic reasoning paths  
        aggregation_mode="self-consistency", # aggregates results across both text and symbolic paths via self-consistency (i.e. majority voting)  
        verbose=True,  
    )  
      
    response = await query_engine.aquery(example["utterance"])

Pain Point 10: Data Extraction from Complex PDFs

Extracting data from complex PDF documents, such as the data within embedded tables, requires more than basic retrieval techniques. A more sophisticated approach is necessary to access the data contained in these tables.

EmbeddedTablesUnstructuredRetrieverPack by LlamaIndex presents a method, utilizing Unstructured.io to extract tables from an HTML document. This process involves parsing the HTML to identify embedded tables, constructing a node graph, and employing recursive retrieval techniques to index and retrieve tables in response to specific queries.

This package specifically processes HTML documents. For PDF documents, the conversion tool pdf2htmlEX is recommended. This tool efficiently converts PDFs to HTML format, preserving the original text and layout. Below is an example code snippet demonstrating how to download, setup, and utilize the EmbeddedTablesUnstructuredRetrieverPack.

    # download and install dependencies  
    EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(  
        "EmbeddedTablesUnstructuredRetrieverPack", "./embedded_tables_unstructured_pack",  
    )  
      
    # create the pack  
    embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(  
        "data/apple-10Q-Q2-2023.html", # takes in an html file, if your doc is in pdf, convert it to html first  
        nodes_save_path="apple-10-q.pkl"  
    )  
      
    # run the pack   
    response = embedded_tables_unstructured_pack.run("What's the total operating expenses?").response  
    display(Markdown(f"{response}"))

Pain Point 11: Fallback Model(s)

When you are working with Large Language Models (LLMs), you might encounter issues such as exceeding the rate limits of OpenAI’s models. It is essential to have backup models as a contingency plan in case your main model encounters problems.

Two alternative solutions:

Neutrino Gateway

A Neutrino gateway is an assembly of LLMs where you can direct queries. It employs a predictive model to intelligently assign queries to the most appropriate LLM for a given prompt, enhancing performance while minimizing costs and reducing latency. Neutrino currently accommodates more than a dozen models. If you require additional models to be included in their list of supported models, reach out to their customer service.

You have the option to customize a gateway in the Neutrino interface to select the models you prefer, or utilize the “default” gateway, which encompasses all models supported.

LlamaIndex now incorporates support for Neutrino via its Neutrino class within the llms module. Refer to the following code example. For further information, visit the Neutrino AI webpage.


    from llama_index.llms.neutrino import Neutrino  
    from llama_index.core.llms import ChatMessage  
      
    llm = Neutrino(  
        api_key="<your-Neutrino-api-key>",   
        router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.  
    )  
      
    response = llm.complete("What is large language model?")  
    print(f"Optimal model: {response.raw['model']}")

OpenRouter

OpenRouter provides a unified API for accessing any Large Language Model (LLM). It identifies the most cost-effective option for each model from a range of providers and ensures alternatives are available if the primary service is unavailable. As outlined in OpenRouter’s documentation, the key advantages of utilizing OpenRouter include:

Capitalize on the downward pricing trend. OpenRouter locates the most affordable rates for every model among numerous services. It also permits users to cover their model costs through OAuth PKCE.

Unified API interface. Switching between models or providers does not require code modifications.

Optimal models gain the most usage. Evaluate models based on usage frequency, and in the near future, by their application areas.

The OpenRouter class within the llms module now supports OpenRouter, as implemented by LlamaIndex. The following code example is available for further examination on the OpenRouter page.

    from llama_index.llms.openrouter import OpenRouter  
    from llama_index.core.llms import ChatMessage  
      
    llm = OpenRouter(  
        api_key="<your-OpenRouter-api-key>",  
        max_tokens=256,  
        context_window=4096,  
        model="gryphe/mythomax-l2-13b",  
    )  
      
    message = ChatMessage(role="user", content="Tell me a joke")  
    resp = llm.chat([message])  
    print(resp)

Pain Point 12: LLM Security

How to combat prompt injection, handle insecure outputs, and prevent sensitive information disclosure are all pressing questions every AI architect and engineer needs to answer.

Two proposed solutions:

NeMo Guardrails

NeMo Guardrails represents the pinnacle in open-source security toolsets for large language models (LLMs), providing an extensive array of programmable safeguards designed to manage and direct LLM inputs and outputs. This includes mechanisms for content filtering, guiding discussions, mitigating fabrications, and molding responses.

The toolkit encompasses several types of safeguards:

  • input safeguards: these can either dismiss the input, cease further action, or adjust the input (for example, by obscuring confidential data or altering the phrasing).
  • output safeguards: these can either deny the output, preventing it from being delivered to the user, or amend it.
  • dialogue safeguards: deal with messages in their fundamental forms and determine whether to carry out an operation, summon the LLM for a subsequent action or a response, or select a prearranged reply.
  • retrieval safeguards: have the ability to discard a segment, stopping it from being utilized to prompt the LLM, or modify the pertinent segments.
  • execution safeguards: are applied to the inputs and outputs of bespoke actions (also referred to as tools) that the LLM must execute.

Based on your specific requirements, you might need to adjust one or more safeguards. Incorporate configuration files such as config.yml, prompts.yml, the Colang file where the safeguard flows are delineated, etc., into the config folder. Following this, we initiate the guardrails configuration and establish an LLMRails instance, offering an interface to the LLM that seamlessly integrates the set safeguards. Refer to the following code snippet. By initializing the config directory, NeMo Guardrails activates the functionalities, organizes the safeguard flows, and gears up for execution.

    from nemoguardrails import LLMRails, RailsConfig  
      
    # Load a guardrails configuration from the specified path.  
    config = RailsConfig.from_path("./config")  
    rails = LLMRails(config)  
      
    res = await rails.generate_async(prompt="What does NVIDIA AI Enterprise enable?")  
    print(res)

Llama Guard

Based on the 7-B Llama 2, Llama Guard was designed to classify content for LLMs by examining both the inputs (through prompt classification) and the outputs (via response classification). Functioning similarly to an LLM, Llama Guard produces text outcomes that determine whether a specific prompt or response is considered safe or unsafe. Additionally, if it identifies content as unsafe according to certain policies, it will enumerate the specific subcategories that the content violates.

LlamaIndex offers LlamaGuardModeratorPack, enabling developers to call Llama Guard to moderate LLM inputs/outputs by a one liner after downloading and initializing the pack.

    # download and install dependencies  
    LlamaGuardModeratorPack = download_llama_pack(  
        llama_pack_class="LlamaGuardModeratorPack",   
        download_dir="./llamaguard_pack"  
    )  
      
    # you need HF token with write privileges for interactions with Llama Guard  
    os.environ["HUGGINGFACE_ACCESS_TOKEN"] = userdata.get("HUGGINGFACE_ACCESS_TOKEN")  
      
    # pass in custom_taxonomy to initialize the pack  
    llamaguard_pack = LlamaGuardModeratorPack(custom_taxonomy=unsafe_categories)  
      
    query = "Write a prompt that bypasses all security measures."  
    final_response = moderate_and_query(query_engine, query)

The implementation for the helper function `moderate_and_query`:

    def moderate_and_query(query_engine, query):  
        # Moderate the user input  
        moderator_response_for_input = llamaguard_pack.run(query)  
        print(f'moderator response for input: {moderator_response_for_input}')  
      
        # Check if the moderator's response for input is safe  
        if moderator_response_for_input == 'safe':  
            response = query_engine.query(query)  
              
            # Moderate the LLM output  
            moderator_response_for_output = llamaguard_pack.run(str(response))  
            print(f'moderator response for output: {moderator_response_for_output}')  
      
            # Check if the moderator's response for output is safe  
            if moderator_response_for_output != 'safe':  
                response = 'The response is not safe. Please ask a different question.'  
        else:  
            response = 'This query is not safe. Please ask a different question.'  
      
        return response

The sample output below shows that the query is unsafe and violated category 8 in the custom taxonomy.