Debugging generation quality#
Debugging generation quality#
You are on this page because your root cause analysis said that improving LLM generation was a root cause to address.
Even with optimal retrieval, if the LLM component of a RAG chain cannot effectively utilize the retrieved context to generate accurate, coherent, and relevant responses, the final output quality will suffer. Issues with generation quality can arise as hallucinations, inconsistencies, or failure to concisely address the user’s query, to name a few.
The following is a step-by-step process to address generation quality issues:
Open the
05_evaluate_poc_quality
NotebookUse the queries to load MLflow traces of the records that retrieval quality issues.
For each record, manually examine the generated response and compare it to the retrieved context and the ground-truth response.
Look for patterns or common issues among the queries with low generation quality. Some examples:
Generating information not present in the retrieved context or outputting contradicting information with respect to the retrieved context (i.e., hallucination)
Failure to directly address the user’s query given the provided retrieved context
Generating responses that are overly verbose, difficult to understand or lack logical coherence
Based on the identified issues, hypothesize potential root causes and corresponding fixes. See the “Common reasons for poor generation quality” table below for guidance.
Follow the steps in implement and evaluate changes to implement and evaluate a potential fix.
This may involve modifying the RAG chain (e.g., adjusting the prompt template, trying a different LLM) or the data pipeline (e.g., adjusting the chunking strategy to provide more context).
If the generation quality is still not satisfactory, repeat steps 4 - 5 for the next most promising fix until the desired performance is achieved.
Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.
Common reasons for poor generation quality#
Each of these potential fixes are can be broadly categorized into three buckets:
changes
changes
changes
Based on the type of change, you will follow different steps in the implement and evaluate changes step.
Important
Broadly, Databricks recommends using prompt engineering to iterate on the quality of your app’s outputs. The majority of the steps below involve prompt engineering.
Generation Issue | Debugging Steps | Potential Fix |
---|---|---|
Generating information not present in the retrieved context (e.g., hallucinations) |
|
|
Failure to directly address the user's query or providing overly generic responses |
|
|
Generating responses that are difficult to understand or lack logical flow |
|
|
Generated responses are not in the desired format or style |
|
|