Step 5: Identify the root cause of quality issues#

../_images/workflow_iterate.png

Expected time: 60 minutes

Requirements#

  1. Your Evaluation results for the POC are available in MLflow

    • If you followed the previous step, this will be the case!

  2. All requirements from previous steps

Code Repository

You can find all of the sample code referenced throughout this section here.

Overview#

Retrieval and generation are the 2 primary buckets of root causes. To determine where we focus on first, we use the output of the Mosaic AI Agent Evaluation’s LLM judges that you ran in the previous step to identify the most frequent root cause that impacts your app’s quality.

Each row your evaluation set will be tagged as follows:

  1. Overall assessment: pass or fail

  2. Root cause: Improve Retrieval or Improve Generation

  3. Root cause rationale: A brief description of why the root cause was selected

Instructions#

The approach depends on if your evaluation set contains the ground-truth responses to your questions - stored in expected_response. If you have expected_response available, use the first table below. Otherwise, use the second table.

  1. Open the 05_evaluate_poc_quality Notebook

  2. Run the cells that are relevant to your use case e.g., if you do or don’t have expected_response

  3. Review the output tables to determine the most frequent root cause in your application

  4. For each root cause, follow the steps below to further debug and identify potential fixes:

Root cause analysis with available ground truth#

Note

If you have human labeled ground-truth for which document should be retrieved for each question, you can optionally substitute retrieval/llm_judged/chunk_relevance/precision/average with the score for retrieval/ground_truth/document_recall/average.

Chunk relevance precision Groundedness Correctness Relevance to query Issue summary Root cause Overall Rating
<50%
Retrieval is poor. Improve Retrieval fail
<50%
LLM generates relevant response, but retrieval is poor e.g., the LLM ignores retrieval and uses its training knowledge to answer. Improve Retrieval fail
<50%
✅ or ❌ Retrieval quality is poor, but LLM gets the answer correct regardless. Improve Retrieval fail
<50%
Response is grounded in retrieval, but retrieval is poor. Improve Retrieval fail
<50%
Relevant response grounded in the retrieved context, but retrieval may not be related to the expected answer. Improve Retrieval fail
<50%
✅ or ❌ Retrieval finds enough information for the LLM to correctly answer. 🎉 N/A pass
>50%
✅ or ❌ Hallucination Improve Generation fail
>50%
✅ or ❌ Hallucination, correct but generates details not in context Improve Generation fail
>50%
Good retrieval, but the LLM does not provide a relevant response. Improve Generation fail
>50%
Good retrieval and relevant response, but not correct. Improve Generation fail
>50%
No issue!! 🎉 N/A pass


Root cause analysis without available ground truth#

Chunk relevance precision Groundedness Relevance to Query Issue summary Root cause Overall rating
<50%
Retrieval quality is poor Improve Retrieval fail
<50%
Retrieval quality is poor Improve Retrieval fail
<50%
Response is grounded in retrieval, but retrieval is poor. Improve Retrieval fail
<50%
Relevant response grounded in the retrieved context and relevant, but retrieval is poor. Improve Retrieval pass
>50%
Hallucination Improve Generation fail
>50%
Hallucination Improve Generation fail
>50%
Good retrieval & grounded, but LLM does not provide a relevant response. Improve Generation fail
>50%
Good retrieval and relevant response. Collect ground-truth to know if the answer is correct. None pass