Debugging generation quality#

Debugging generation quality#

You are on this page because your root cause analysis said that improving LLM generation was a root cause to address.

Even with optimal retrieval, if the LLM component of a RAG chain cannot effectively utilize the retrieved context to generate accurate, coherent, and relevant responses, the final output quality will suffer. Issues with generation quality can arise as hallucinations, inconsistencies, or failure to concisely address the user’s query, to name a few.

The following is a step-by-step process to address generation quality issues:

  1. Open the 05_evaluate_poc_quality Notebook

  2. Use the queries to load MLflow traces of the records that retrieval quality issues.

  3. For each record, manually examine the generated response and compare it to the retrieved context and the ground-truth response.

  4. Look for patterns or common issues among the queries with low generation quality. Some examples:

    • Generating information not present in the retrieved context or outputting contradicting information with respect to the retrieved context (i.e., hallucination)

    • Failure to directly address the user’s query given the provided retrieved context

    • Generating responses that are overly verbose, difficult to understand or lack logical coherence

  5. Based on the identified issues, hypothesize potential root causes and corresponding fixes. See the “Common reasons for poor generation quality” table below for guidance.

  6. Follow the steps in implement and evaluate changes to implement and evaluate a potential fix.

    • This may involve modifying the RAG chain (e.g., adjusting the prompt template, trying a different LLM) or the data pipeline (e.g., adjusting the chunking strategy to provide more context).

  7. If the generation quality is still not satisfactory, repeat steps 4 - 5 for the next most promising fix until the desired performance is achieved.

  8. Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.

Common reasons for poor generation quality#

Each of these potential fixes are can be broadly categorized into three buckets:

  1. Data pipeline changes

  2. Chain config changes

  3. Chain code changes

Based on the type of change, you will follow different steps in the implement and evaluate changes step.

Important

Broadly, Databricks recommends using prompt engineering to iterate on the quality of your app’s outputs. The majority of the steps below involve prompt engineering.

Generation Issue Debugging Steps Potential Fix
Generating information not present in the retrieved context (e.g., hallucinations)
  • Compare generated responses to retrieved context to identify hallucinated information
  • Assess if certain types of queries or retrieved context are more prone to hallucinations
  • chain-config Update prompt template to emphasize reliance on retrieved context
  • chain-config Use a more capable LLM
  • chain-code Implement a fact-checking or verification step post-generation
Failure to directly address the user's query or providing overly generic responses
  • Compare generated responses to user queries to assess relevance and specificity
  • Check if certain types of queries result in the correct context being retrieved, but the LLM producing low quality output
  • chain-config Improve prompt template to encourage direct, specific responses
  • chain-config Retrieve more targeted context by improving the retrieval process
  • chain-code Re-rank retrieval results to put most relevant chunks first, only provide these to the LLM
  • chain-config Use a more capable LLM
Generating responses that are difficult to understand or lack logical flow
  • Assess output for logical flow, grammatical correctness, and understandability
  • Analyze if incoherence occurs more often with certain types of queries or when certain types of context are retrieved
  • chain-config Change prompt template to encourage coherent, well-structured response
  • chain-config Provide more context to the LLM by retrieving additional relevant chunks
  • chain-config Use a more capable LLM
Generated responses are not in the desired format or style
  • Compare output to expected format and style guidelines
  • Assess if certain types of queries or retrieved context are more likely to result in format/style deviations
  • chain-config Update prompt template to specify the desired output format and style
  • chain-code Implement a post-processing step to convert the generated response into the desired format
  • chain-code Add a step to validate output structure/style, and output a fallback answer if needed.
  • chain-config Use an LLM fine-tuned to provide outputs in a specific format or style