Debugging retrieval quality#

How to debug retrieval quality#

You are on this page because your root cause analysis said that improving retrieval a the root cause to address.

Retrieval quality is arguably the most important component of a RAG application. If the most relevant chunks are not returned for a given query, the LLM will not have access to the necessary information to generate a high-quality response. Poor retrieval can thus lead to irrelevant, incomplete, or hallucinated output. This step requires manual effort to analyze the underlying data. With Mosaic AI, this becomes considerably easier given the tight integration between the data platform (Unity Catalog and Vector Search), and experiment tracking (MLflow LLM evaluation and MLflow tracing).

Instructions#

Here’s a step-by-step process to address retrieval quality issues:

  1. Open the 05_evaluate_poc_quality Notebook

  2. Use the queries to load MLflow traces of the records that retrieval quality issues.

  3. For each record, manually examine the retrieved chunks. If available, compare them to the ground-truth retrieval documents.

  4. Look for patterns or common issues among the queries with low retrieval quality. Some examples might include:

    • Relevant information is missing from the vector database entirely

    • Insufficient number of chunks/documents returned for a retrieval query

    • Chunks are too small and lack sufficient context

    • Chunks are too large and contain multiple, unrelated topics

    • The embedding model fails to capture semantic similarity for domain-specific terms

  5. Based on the identified issue, hypothesize potential root causes and corresponding fixes. See the Common reasons for poor retrieval quality table below for guidance on this.

  6. Follow the steps in implement and evaluate changes to implement and evaluate a potential fix.

    • This may involve modifying the data pipeline (e.g., adjusting chunk size, trying a different embedding model) or modifying the RAG chain (e.g., implementing hybrid search, retrieving more chunks).

  7. If retrieval quality is still not satisfactory, repeat steps 4-5 for the next most promising fixes until the desired performance is achieved.

  8. Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.

Common reasons for poor retrieval quality#

Each of these potential fixes are can be broadly categorized into three buckets:

  1. Data pipeline changes

  2. Chain config changes

  3. Chain code changes

Based on the type of change, you will follow different steps in the implement and evaluate changes step.

Retrieval Issue Debugging Steps Potential Fix
Chunks are too small
  • Examine chunks for incomplete cut-off information
  • data-pipeline Increase chunk size and/or overlap
  • data-pipeline Try different chunking strategy
Chunks are too large
  • Check if retrieved chunks contain multiple, unrelated topics
  • data-pipeline Decrease chunk size
  • data-pipeline Improve chunking strategy to avoid mixture of unrelated topics (e.g., semantic chunking)
Chunks don't have enough information about the text from which they were taken
  • Assess if the lack of context for each chunk is causing confusion or ambiguity in the retrieved results
  • data-pipeline Try adding metadata & titles to each chunk (e.g., section titles)
  • chain-config Retrieve more chunks, and use an LLM with larger context size
Embedding model doesn't accurately understand the domain and/or key phrases in user queries
  • Check if semantically similar chunks are being retrieved for the same query
  • data-pipeline Try different embedding models
  • chain-config Hybrid search
  • chain-code Over-fetch retrieval results, and re-rank. Only feed top re-ranked results into the LLM context
  • data-pipeline Fine-tune embedding model on domain-specific data
Relevant information missing from the vector database
  • Check if any relevant documents or sections are missing from the vector database
  • data-pipeline Add more relevant documents to the vector database
  • data-pipeline Improve document parsing and metadata extraction
Retrieval queries are poorly formulated
  • If user queries are being directly used for semantic search, analyze these queries and check for ambiguity, or lack of specificity. This can happen easily in multi-turn conversations where the raw user query references previous parts of the conversation, making it unsuitable to use directly as a retrieval query.
  • Check if query terms match terminology used in the search corpus
  • chain-code Add query expansion or transformation approaches (i.e., given a user query, transform the query prior to semantic search)
  • chain-code Add query understanding to identify intent and entities (e.g., use an LLM to extract properties to use in metadata filtering)