A Guide on 12 Tuning Strategies for Production RAG Systems

Know Early AI Trends!

Sign-up to get Trends and Tools related to AI directly to your inbox

We don’t spam!

This article covers the following “hyperparameters” sorted by their relevant stage. In the ingestion stage of a RAG pipeline, you can achieve performance improvements by:

  • Data cleaning
  • Chunking
  • Embedding models
  • Metadata
  • Multi-indexing
  • Indexing algorithms

And in the inferencing stage (retrieval and generation), you can tune:

  • Query transformations
  • Retrieval parameters
  • Advanced retrieval strategies
  • Re-ranking models
  • LLMs
  • Prompt engineering

Note that this article covers text-use cases of RAG. For multimodal RAG applications, different considerations may apply.

Ingestion Stage

The ingestion stage is a preparation step for building a RAG pipeline, similar to the data cleaning and preprocessing steps in an ML pipeline. Usually, the ingestion stage consists of the following steps:

  1. Collect data
  2. Chunk data
  3. Generate vector embeddings of chunks
  4. Store vector embeddings and chunks in a vector database

Ingestion stage of a RAG pipeline

This section discusses impactful techniques and hyperparameters that you can apply and tune to improve the relevance of the retrieved contexts in the inferencing stage.

Data Purification

In any Data Science workflow, the integrity of your data significantly influences the results in your RAG process [8, 9]. Prior to advancing to subsequent phases, verify that your dataset adheres to the criteria below:

  • Pure : Implement at least the fundamental data purification methods typically employed in Natural Language Processing, such as ensuring all special symbols are correctly encoded.
  • Accurate : Ascertain that your data is coherent and factually precise to prevent the LLM from being misled by inconsistent information.

Chunking

Chunking your texts is a crucial preparatory act for your external knowledge source in a RAG setup that can affect efficiency [1, 8, 9]. It’s a strategy for producing logically cohesive information snippets by dividing lengthy texts into shorter segments (but can also amalgamate smaller pieces into coherent paragraphs).

A decision you must consider is the selection of the segmentation technique. For instance, in LangChain, various text dividers fracture documents according to different rationales, such as by characters, tokens, etc. This choice hinges on your data’s nature. For instance, distinct segmentation methods are necessary if your input is code versus a Markdown document.

The optimal dimension of your segment (**segment_size** ) is dictated by your specific application: For question answering, more concise specific segments are preferable, whereas for summarization, more extended segments are beneficial. Moreover, if a segment is excessively brief, it may lack sufficient context. Conversely, if a segment is overly lengthy, it may encompass too much irrelevant data.

Furthermore, you should consider a “sliding window” amongst segments (**overlap** ) to provide some extra context.

Models for Embedding

At the heart of your retrieval are the embedding models. The accuracy of your embeddings significantly influences the outcomes of your retrieval [1, 4]. Typically, the greater the dimensionality of the embeddings produced, the more precise your embeddings will be.

To explore what other embedding models exist, one can refer to the Massive Text Embedding Benchmark (MTEB) Leaderboard, which lists 164 models for text embedding (as of the current date).

MTEB Leaderboard – A Hugging Face Space by mteb

Explore fantastic ML applications created by the community

huggingface.co

While general-purpose models for embedding can be utilized directly, in certain scenarios, tuning your embedding model specifically for your particular application might be advisable to circumvent issues related to domain-specificity later on [9]. According to studies by LlamaIndex, tuning an embedding model specifically can enhance retrieval performance metrics by 5–10% [2].

However, not every embedding model is capable of being fine-tuned (for instance, OpenAI’s text-embedding-ada-002 is not adjustable at present).

Metadata

Upon storing vector embeddings in a vector database, certain databases for vectors allow the inclusion of metadata (or non-vectorized data) alongside them. Tagging vector embeddings with metadata can aid in further refining search outcomes post-retrieval, such as through filtering based on metadata [1, 3, 8, 9]. For instance, metadata like the date, chapter, or subchapter reference could be appended.

Multi-indexing

If the metadata is not adequate enough to offer additional details to distinguish various kinds of context effectively, you might wish to explore using multiple indices. For instance, you could utilize separate indices for diverse kinds of documents. Be aware that this will necessitate some mechanism of index direction at the time of retrieval. If you are keen on gaining an in-depth understanding of metadata and distinct collections, you may be interested in exploring the idea of native multi-tenancy.

Indexing algorithms

To facilitate ultra-fast similarity searches on a large scale, vector databases and vector indexing libraries opt for an Approximate Nearest Neighbor (ANN) search over a k-nearest neighbor (kNN) search. By design, ANN algorithms provide an approximation of the nearest neighbors, which may not be as accurate as those identified by a kNN algorithm.

Various ANN algorithms are available for experimentation, such as Facebook’s Faiss (clustering), Spotify’s Annoy (trees), Google’s ScaNN (vector compression), and HNSWLIB (proximity graphs). Moreover, several of these ANN algorithms offer tunable parameters like efConstruction, and maxConnections in HNSW [1].

In addition, vector compression can be activated for these indexing algorithms. Similar to ANN algorithms, vector compression introduces a loss in precision. Nevertheless, the impact on precision can be mitigated by selecting an appropriate vector compression algorithm and fine-tuning its parameters.

Typically, the adjustment of these parameters is undertaken by the research teams of vector databases and vector indexing libraries as part of benchmarking tests, rather than by developers of RAG systems. Nonetheless, for those interested in tweaking these parameters to extract optimal performance, the following article is suggested as a preliminary guide:

Inferencing Stage (Retrieval & Generation)

The main components of the RAG pipeline are the retrieval and the generative components. This section mainly discusses strategies to improve the retrieval (Query transformations, retrieval parameters, advanced retrieval strategies, and re-ranking models) as this is the more impactful component of the two. But it also briefly touches on some strategies to improve the generation (LLM and prompt engineering).

Inference stage of a RAG pipeline

Query transformations

Since the search query to retrieve additional context in a RAG pipeline is also embedded into the vector space, its phrasing can also impact the search results. Thus, if your search query doesn’t result in satisfactory search results, you can experiment with various query transformation techniques [5, 8, 9], such as:

  • Rephrasing: Use an LLM to rephrase the query and try again.
  • Hypothetical Document Embeddings (HyDE): Use an LLM to generate a hypothetical response to the search query and use both for retrieval.
  • Sub-queries: Break down longer queries into multiple shorter queries.

Retrieval parameters

The retrieval is an essential component of the RAG pipeline. The first consideration is whether semantic search will be sufficient for your use case or if you want to experiment with hybrid search.

In the latter case, you need to experiment with weighting the aggregation of sparse and dense retrieval methods in hybrid search [1, 4, 9].

Improving Retrieval Performance in RAG Pipelines with Hybrid Search

Also, the number of search results to retrieve will play an essential role. The number of retrieved contexts will impact the length of the used context window (see Prompt Engineering). Also, if you are using a re-ranking model, you need to consider how many contexts to input to the model (see Re-ranking models).

Note, while the used similarity measure for semantic search is a parameter you can change, you should not experiment with it but instead set it according to the used embedding model (e.g., text-embedding-ada-002 supports cosine similarity or multi-qa-MiniLM-l6-cos-v1 supports cosine similarity, dot product, and Euclidean distance).

Advanced retrieval strategies

This section could technically be its own article. For this overview, we will keep this as concise as possible. For an in-depth explanation of the following techniques, I recommend this DeepLearning.AI course:

Building and Evaluating Advanced RAG Applications

Learn methods like sentence-window retrieval and auto-merging retrieval, improving your RAG pipeline’s performance…

www.deeplearning.ai

The underlying idea of this section is that the chunks for retrieval shouldn’t necessarily be the same chunks used for the generation. Ideally, you would embed smaller chunks for retrieval (see Chunking) but retrieve bigger contexts. [7]

  • Sentence-window retrieval: Do not just retrieve the relevant sentence, but the window of appropriate sentences before and after the retrieved one.
  • Auto-merging retrieval: The documents are organized in a tree-like structure. At query time, separate but related, smaller chunks can be consolidated into a larger context.

Re-ranking models

While semantic search retrieves context based on its semantic similarity to the search query, “most similar” doesn’t necessarily mean “most relevant”. Re-ranking models , such as Cohere’s Rerank model, can help eliminate irrelevant search results by computing a score for the relevance of the query for each retrieved context [1, 9].

“most similar” doesn’t necessarily mean “most relevant”

If you are using a re-ranker model, you may need to re-tune the number of search results for the input of the re-ranker and how many of the reranked results you want to feed into the LLM.

As with the embedding models, you may want to experiment with fine-tuning the re-ranker to your specific use case.

LLMs

The LLM is the core component for generating the response. Similarly to the embedding models, there is a wide range of LLMs you can choose from depending on your requirements, such as open vs. proprietary models, inferencing costs, context length, etc. [1]

As with the embedding models or re-ranking models, you may want to experiment with fine-tuning the LLM to your specific use case to incorporate specific wording or tone of voice.

Prompt engineering

How you phrase or engineer your prompt will significantly impact the LLM’s completion [1, 8, 9].


    Please base your answer only on the search results and nothing else!

    Very important! Your answer MUST be grounded in the search results provided.   
    Please explain why your answer is grounded in the search results!

Additionally, using few-shot examples in your prompt can improve the quality of the completions.

As mentioned in Retrieval parameters, the number of contexts fed into the prompt is a parameter you should experiment with [1]. While the performance of your RAG pipeline can improve with increasing relevant context, you can also run into a “Lost in the Middle” [6] effect where relevant context is not recognized as such by the LLM if it is placed in the middle of many contexts.