How to build a custom data set for Question Answering #2207

gabriead · 2022-02-17T09:57:56Z

gabriead
Feb 17, 2022

Hi community,
I have two questions regarding building a custom data set for the "Question Answering" task.
I have very detailed user questions and their respective answers, but no context that the answers are referring to. What can be done to make use of the questions/answers within a training data set? Can this still be used to train a model? Can I "synthetically" embed the answers in a context? Or is there a different model where it is sufficient to input "only" questions and answers, without context and still get good results?
Is there a rule of thumb about how much questions/answers are needed to do a reasonable training?

Answered by bogdankostic

Feb 21, 2022

Hi @gabriead! To train an extractive QA model, you would need a context which contains the answer and the exact position of the answer inside this context. Therefore, you would need to map your question-answer pairs to a document containing the answer and extract the position of the answer. However, you might use your data to do open-domain evaluation, as this does not require to extract the exact position of an answer. Like this, you can check whether the existing models are already good enough for your use case such that you don't need to train a custom model. See this blog post for more information on evaluation.

As to how many labels are needed to do reasonable training: This depends …

View full answer

bogdankostic · 2022-02-21T10:36:19Z

bogdankostic
Feb 21, 2022

Hi @gabriead! To train an extractive QA model, you would need a context which contains the answer and the exact position of the answer inside this context. Therefore, you would need to map your question-answer pairs to a document containing the answer and extract the position of the answer. However, you might use your data to do open-domain evaluation, as this does not require to extract the exact position of an answer. Like this, you can check whether the existing models are already good enough for your use case such that you don't need to train a custom model. See this blog post for more information on evaluation.

As to how many labels are needed to do reasonable training: This depends highly on your domain and how much your use case diverges from SQuAD. We have seen that models trained on SQuAD show very strong general question answering capabilities. Therefore, we’d recommend trying one of the off the shelf models before trying to adapt these models to your domain.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to build a custom data set for Question Answering #2207

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to build a custom data set for Question Answering #2207

gabriead Feb 17, 2022

Replies: 1 comment

bogdankostic Feb 21, 2022

gabriead
Feb 17, 2022

bogdankostic
Feb 21, 2022