-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Refinement Backtranslation #27
Comments
@hosseinfani
If the file type is txt, the tag will be activated. I also modified the Subsequently, I incorporated the backtranslation expander exclusively for the French language. You can find the relevant code snippet in the "../qe/expander/backtranslation.py" file. The settings of the backtranslation model and languages are in the "../qe/cmn/param.py" file. To facilitate result comparison, I have developed the "toy-compare.py" python code, which can be found in the toy file. However, I plan to relocate this file to the "../qe/eval" directory. There are three available functions that can be used to compare the results:
Next, I made further updates to the code to enable it to handle multiple languages and generate queries accordingly. Subsequently, I compared the results using the "toy-compare.py" script. However, there are still a few remaining bugs in the project:
By next Friday, I have outlined the following tasks to be completed:
|
@DelaramRajaei
|
@DelaramRajaei |
@hosseinfani Yes, in the next step I am completeing the plot and report my finding about it. |
i don't think so. it automatically accumulate |
I have updated the code and pushed the new changes. I fixed the bug of the antique dataset. I changed main.py and abstractqexpander.py . There were some problems in reading and writing new queries in .txt files. Here are two logs of running the code with backtranslation expander on 5 different languages for robust04 and antique datasets. |
@hosseinfani |
@DelaramRajaei |
@DelaramRajaei |
I have run the program for below datasets and here are the logs of running the code with backtranslation expander on 5 different languages for these datasets.
Unfortunately, I was unable to download the indexes for the ClueWeb datasets due to their large size. I'm currently in the process of drafting the paper and analyzing the results to identify any trends. |
@DelaramRajaei |
I have run the program for clueweb09b and this is the log. Unfortunately, I encountered a problem with the zip files for clueweb12b13, as they were found to be corrupted. I am currently exploring potential solutions to fix this issue. In addition to that, I have been focusing on plotting the results and comparing the mean Average Precision (mAP) of the original queries with that of the backtranslated queries. |
After analyzing the results of these five datasets in five distinct languages, here are the findings: Overall, it can be observed that the datasets "dbpedia" and "robust04" tend to yield superior results compared to the other datasets. Additionally, Isaac compiled a list of new datasets related to law, medicine, and finance. I can process and analyze these new datasets. Also, I can change the translation model and see if that makes the results better. |
Hey @hosseinfani, I'd like to fill you in on my activities this week. I've been working on adding tct-colbert as a dense retrieval method. I went through the RePair project and pyserini's documentation on dense retrieval. It seems that I need to modify the format of my stored files within the To maintain the integrity of the original code, I introduced new functions: There was a problem with the So, I decided to restructure the approach a bit. Let me give you an overview of how things are unfolding within the In this sequence, we begin by providing the filename of the original queries in any format. Once the file is read, it produces a dataframe as output. The Afterward, by specifying the file name, the This architeture now support using only pyserini and let me remove the anserini from the code in the search and evalute function. To modify the evaluation function, I referred to the documentation provided by Pyserini. After encountering several bugs and errors, I'm pleased to share that I've managed to address all of them today, resulting in the project running seamlessly now. Subsequently, I attempted to add tct_colbert to the project, and I successfully achieved it. However, I'm currently facing an indexing issue with the datasets. Specifically, I need to encode them and obtain the dense index. In the meantime, I've gone ahead and made updates to both the environment.yaml and requirement.txt files. I've changed library versions and introduced some new ones as well. Isaac reviewed these changes and confirmed their smooth functionality. I've made updates to the Excel task sheet. Regarding my upcoming tasks, here's the list:
|
@DelaramRajaei |
This is the issue where I report my progress on the project.
The text was updated successfully, but these errors were encountered: