feat(datasets): add synthetic data generation pipeline #127
Labels
document search
Changes to the document search package
evals
Adding new evaluation pipelines or improving existing ones
feature
New feature or request
Milestone
Feature description
It'd be great to be able to generate synthetic datasets for each module we want to evaluate. For now, we're focusing only on the document search module, but in the future we also want pipelines for other modules, such as generation (LLM answers based on context) or structured data retrieval (text-to-sql data), so the interface should allow us to easily introduce new pipelines.
For document retrieval, we should scan the corpus and generate a set of questions, with carefully selected passages that will include the answer to the question and optionally some important context that will allow the model to generate a factual answer. Ideally, both of these parts should be generated by LLM and verified.
Motivation
In most cases, we'll have access to corpus data, but we will lack qa pairs to evaluate the system. Manual annotation is tedious, so having an automated pipeline to generate high-quality datasets would save some development time.
Additional context
Some libs, such as LlamaIndex, have similar pipeline defined here, but I don't know the quality of this solution. We may also consider integrating with 3rd-party tools such as Argilla.
The text was updated successfully, but these errors were encountered: