-
Hello, I am quite desperate trying to make this work but I'm not able. I am doing my master thesis and would like to train a DPR using a SQuAD dataset and have to submit it next week. I will put you a bit in context. My goal is to train a DPR for Catalan language using the SQuAD dataset translated for this language (https://huggingface.co/datasets/BSC-TeMU/viquiquad). I saw that you have a script that directly takes the hard negative passages using the BM25 as it is shown in the paper. I tried to use this script for converting the SQuAD dataset into DPR format but I cannot initialize the ElasticSearch server neither the Faiss DocumentStore I tried to install Faiss to support the DocumentStore but nothing, my computer does not support cuda because my GPU is quite old. I also tried to execute the code in Collab because I am quite desperate but I am not able to access the server, I manage to create it but apart from that I cannot write data on it. My final option was to change the code manually for making InMemoryDocumentStore able for making it work, but I got really confused with the objects used in the code. As my dataset is composed only of 11k questions, I do not really think that it is necessary to use ElasticSearch or Faiss to retriever the most relevant document, maybe I am wrong and have no really idea of what I am doing. I also tried to execute such code in the computer of a friend but error of ElasticSearch arrise again. If needed I can share more screenshots, I just deleted all the environments in anaconda because I none of them worked. Is it possible to make the changes for making it work with InMemoryDocumentStore? I think I already tried everything to make it work but I am going to get crazy with this. I also thought about creating my own code for retrieving the hard negative contexts but I am just really stupid and not able to do it well and will take me a lot of time for the time left I have. Just needed to train this to give more consistency to the project. I will be really grateful if someone could give me a hint of what's going on with the code. If not I will just rule out the idea of training the model Thanks for all the effort creating this library. It helped me a lot with my master thesis. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I am back here again with good news 😆 I just review the code and saw that it was necessary create an ElasticSearchDocumentStore because by default BM25 uses ElasticsearchDocumentStore, OpenSearchDocumentStore or OpenDistroElasticsearchDocumentStore I didn't have any hope that this could work but I just execute the code from this tutorial https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb#scrollTo=cJ36e9nia7-l for creating the ElasticSearchServer. With 0 hope I just ran the code and magically everything started to work smooth. I know SQuAD is no the more suitable dataset for training the DPR but there in no other dataset for catalan's language so I just choose this one. The hard negatives look good and will just run the training process with a pretrained catalan model. Thanks again for developing this amazing library Greetings 😄 |
Beta Was this translation helpful? Give feedback.
-
Hi @Myko-10 I created a starter notebook based on tutorial 09 (DPR training): DPR-Catalan.ipynb. It downloads the data from huggingface, starts the ES server, and then runs the squad_to_dpr script to prepare the data for training. But ya, good if you managed to get it working as well :D Note: I think you might have to be vigilant about the details of tokenization. It might be English based though not sure if it's a problem for Catalan. |
Beta Was this translation helpful? Give feedback.
I am back here again with good news 😆
I just review the code and saw that it was necessary create an ElasticSearchDocumentStore because by default BM25 uses ElasticsearchDocumentStore, OpenSearchDocumentStore or OpenDistroElasticsearchDocumentStore
I didn't have any hope that this could work but I just execute the code from this tutorial https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb#scrollTo=cJ36e9nia7-l for creating the ElasticSearchServer.
With 0 hope I just ran the code and magically everything started to work smooth. I know SQuAD is no the more suitable dataset for training the DPR but there in no other dataset for cata…