Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): add synthetic data generation pipeline #127

Closed
micpst opened this issue Oct 22, 2024 · 0 comments · Fixed by #165
Closed

feat(datasets): add synthetic data generation pipeline #127

micpst opened this issue Oct 22, 2024 · 0 comments · Fixed by #165
Assignees
Labels
document search Changes to the document search package evals Adding new evaluation pipelines or improving existing ones feature New feature or request
Milestone

Comments

@micpst
Copy link
Collaborator

micpst commented Oct 22, 2024

Feature description

It'd be great to be able to generate synthetic datasets for each module we want to evaluate. For now, we're focusing only on the document search module, but in the future we also want pipelines for other modules, such as generation (LLM answers based on context) or structured data retrieval (text-to-sql data), so the interface should allow us to easily introduce new pipelines.

For document retrieval, we should scan the corpus and generate a set of questions, with carefully selected passages that will include the answer to the question and optionally some important context that will allow the model to generate a factual answer. Ideally, both of these parts should be generated by LLM and verified.

Motivation

In most cases, we'll have access to corpus data, but we will lack qa pairs to evaluate the system. Manual annotation is tedious, so having an automated pipeline to generate high-quality datasets would save some development time.

Additional context

Some libs, such as LlamaIndex, have similar pipeline defined here, but I don't know the quality of this solution. We may also consider integrating with 3rd-party tools such as Argilla.

@micpst micpst added feature New feature or request document search Changes to the document search package evals Adding new evaluation pipelines or improving existing ones labels Oct 22, 2024
@micpst micpst added this to ragbits Oct 22, 2024
@micpst micpst moved this to Backlog in ragbits Oct 22, 2024
@micpst micpst moved this from Backlog to Ready in ragbits Oct 22, 2024
@mhordynski mhordynski added this to the Ragbits 0.4 milestone Oct 23, 2024
@akotyla akotyla self-assigned this Oct 25, 2024
@micpst micpst moved this from Ready to In Progress in ragbits Oct 28, 2024
@akotyla akotyla linked a pull request Oct 31, 2024 that will close this issue
@kdziedzic68 kdziedzic68 assigned kdziedzic68 and unassigned akotyla Nov 7, 2024
@github-project-automation github-project-automation bot moved this from In review to Done in ragbits Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document search Changes to the document search package evals Adding new evaluation pipelines or improving existing ones feature New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants