feat(datasets): add synthetic data generation pipeline #127

micpst · 2024-10-22T09:34:49Z

Feature description

It'd be great to be able to generate synthetic datasets for each module we want to evaluate. For now, we're focusing only on the document search module, but in the future we also want pipelines for other modules, such as generation (LLM answers based on context) or structured data retrieval (text-to-sql data), so the interface should allow us to easily introduce new pipelines.

For document retrieval, we should scan the corpus and generate a set of questions, with carefully selected passages that will include the answer to the question and optionally some important context that will allow the model to generate a factual answer. Ideally, both of these parts should be generated by LLM and verified.

Motivation

In most cases, we'll have access to corpus data, but we will lack qa pairs to evaluate the system. Manual annotation is tedious, so having an automated pipeline to generate high-quality datasets would save some development time.

Additional context

Some libs, such as LlamaIndex, have similar pipeline defined here, but I don't know the quality of this solution. We may also consider integrating with 3rd-party tools such as Argilla.

micpst added feature New feature or request document search Changes to the document search package evals Adding new evaluation pipelines or improving existing ones labels Oct 22, 2024

micpst added this to ragbits Oct 22, 2024

micpst moved this to Backlog in ragbits Oct 22, 2024

micpst moved this from Backlog to Ready in ragbits Oct 22, 2024

mhordynski added this to the Ragbits 0.4 milestone Oct 23, 2024

akotyla self-assigned this Oct 25, 2024

micpst moved this from Ready to In Progress in ragbits Oct 28, 2024

akotyla linked a pull request Oct 31, 2024 that will close this issue

feat(datasets): add synthetic data generation pipeline #165

Merged

kdziedzic68 assigned kdziedzic68 and unassigned akotyla Nov 7, 2024

kdziedzic68 closed this as completed in #165 Nov 20, 2024

github-project-automation bot moved this from In review to Done in ragbits Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add synthetic data generation pipeline #127

feat(datasets): add synthetic data generation pipeline #127

micpst commented Oct 22, 2024 •

edited

Loading

feat(datasets): add synthetic data generation pipeline #127

feat(datasets): add synthetic data generation pipeline #127

Comments

micpst commented Oct 22, 2024 • edited Loading

Feature description

Motivation

Additional context

micpst commented Oct 22, 2024 •

edited

Loading