-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sum url content using Sentence Transformers before making prediction #114
Conversation
…ormation from each website in order to estimate the prediction market question later. This strategy replaced the strategy where only the first 300 characters from each website's html text were extracted and used for prediction.
… extract only the relevant sections.
…T-4 is able to give p_no = 1 when specified event deadline exceeded and no information indicating for p_yes are found.
…f the event happening by a specific deadline only if it assumes that it has access to up-to-date information.
…date information queried from a search engine; Must pay attention on keyword BEFORE in 'event question'.
…t instructions to also pay attention on the keyword ON in 'event question'.
…nt dates and surrounding context
…pletion; added minor changes accounting for not exceeding string limits
… for better efficiency
…ulating similarities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
CI failures are unrelated to the changes here. |
…1 for better performance
…es and return values
…idered by gpt-4 yet.
…an event happening BY a certain date. Consideres if the event has already happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool does the following tasks:
- Generates search engine queries with the openai api based on a user promt
- Get the top urls for each query via the google custom search api
- Loads and cleans each html and splits the text into sentences using spacy.
- Uses sentene transformer 'multi-qa-distilbert-cos-v1' to create sentence embeddings for each sentence and the event question.
- Calculates the dot product between the event question and each sentence to get a similarity score
- Concatenates the top sentences from each website.
- Passes the extracted information as additional information into the prediction prompt.
- Make an openai api call and provides the prediction prompt, which requests a probability estimation for the event in the user prompt to occur, given the additional information.
This reverts commit 6df7711.
5bd2787
to
6617bfe
Compare
I'm struggling to get it running. Here what I'm trying to archieve: For gathering the most relevant information on the internet for predicting the question from the prediction market I want to use a sentence embedding model. These models map each sentence into a vector space. By calculating the dot product between each sentence embedding and the question embedding, it is possible to get some kind of semantic similarity between the question and each sentence. The sentences with the most similarity to the prediction market question, tend to be most relevant for making a prediction estimation. I tested different models locally and they perform very good, while being cost efficient. Only the preselected relevant content will be passed to GPT-4 which will make the prediction. The frameworks that are best suited for that are Sentence Transformer (BERT) and Universal Sentence Encoder (TensorFlow). With either model the code does not pass the PR workflow. Sentence Transformers (BERT)
Universal Sentence Transformer (TensorFlow) Is it possible to add support for the url substitution type to enable the use of PyTorch without CUDA in the code? Or alternatively is it possible to extend the authorized licenses with NVIDIA? I think there is a huge potential in preprocessing for selecting the most relevant information to make a reliable prediction. I would appreciate if you can help me here. Maybe you have suggestions, too? |
@0xArdi The licence related to NVIDIA can be whitelisted, as it is released under Apache 2.0 https://github.com/NVIDIA/NVTX/blob/release-v3/LICENSE.txt |
@0xArdi
It would be possible to define a pytorch version in the pyproject.toml that does not automatically install the NVIDIA CUDA toolkit. For this I have to define the source url of the version in the pyproject.toml. But when I do this the workflow does not pass with the error:
If there would be added support for the 'url' substitution type, there would be no need to whitelist the NVIDIA licenses I think, because then the corresponding packages would not be installed with pytorch. |
Hello this is Armagan from Gnosis Congratulations on winning a prize for the Best Mech Mech Tool in the Prediction Agent Hackathon! Your innovative contribution stood out, and we're excited to award your work. Please reply to this comment with your Gnosis Chain wallet address so we can proceed with transferring your prize in xDAI. Thank you for your fantastic work and looking forward to your continued contributions! |
Hi @0xarmagan, thank you for the great news! I'm honored and looking forward to future collaborations and contributions. My wallet address is 0x4a143CdC6B9A8AFb4D54E01A95A66c29ce8591DF. |
This PR implements a new tool that uses the openAI API to extract the most relevant information from the html content of a browsed url. This can be seen as a preprocessing of each url's content, which is stored inside the additional information variable. This additional information leads to more accurate predictions by the tool.
Work in progress: