You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.
Feedback has been provided with the common themes needing to be addressed:
Some evaluations (primarily NIAH) are always passing 100% and as such, are not helpful for tracking growth over time
Some NIAH and QA evals are not leveraging the full chunk data in RAG responses and as such are not evaluating RAG to the extent it should be
Evaluation results are not currently being stored anywhere
The current implementation of LFAI evals is very specific to the OpenAI way of handling RAG, and therefore the evaluations can't be run against custom RAG pipelines (a delivery concern).
MMLU results suspiciously sometimes return the same score for multiple topics, indicating a potential problem with the evaluation 🐛
Completion Criteria
Create a new Needle In a Haystack Dataset for more difficult evaluations
LeapfrogAI Evaluations v1.1
Description
Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.
Feedback has been provided with the common themes needing to be addressed:
Completion Criteria
The text was updated successfully, but these errors were encountered: