Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum url content using Sentence Transformers before making prediction #114

Merged
merged 36 commits into from
Oct 13, 2023

Conversation

jhehemann
Copy link
Contributor

This PR implements a new tool that uses the openAI API to extract the most relevant information from the html content of a browsed url. This can be seen as a preprocessing of each url's content, which is stored inside the additional information variable. This additional information leads to more accurate predictions by the tool.

Work in progress:

  • currently the get_website_summary() function feeds the gpt-3.5-turbo model the whole html content. As the model can only process 4K tokens, it throws an exception and skips the url if the html content exceeds the limit.

…ormation from each website in order to estimate the prediction market question later. This strategy replaced the strategy where only the first 300 characters from each website's html text were extracted and used for prediction.
@jhehemann jhehemann marked this pull request as ready for review September 21, 2023 15:40
…T-4 is able to give p_no = 1 when specified event deadline exceeded and no information indicating for p_yes are found.
…f the event happening by a specific deadline only if it assumes that it has access to up-to-date information.
…date information queried from a search engine; Must pay attention on keyword BEFORE in 'event question'.
…t instructions to also pay attention on the keyword ON in 'event question'.
…pletion; added minor changes accounting for not exceeding string limits
Copy link
Collaborator

@0xArdi 0xArdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@0xArdi
Copy link
Collaborator

0xArdi commented Sep 25, 2023

CI failures are unrelated to the changes here.

Copy link
Contributor Author

@jhehemann jhehemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool does the following tasks:

  • Generates search engine queries with the openai api based on a user promt
  • Get the top urls for each query via the google custom search api
  • Loads and cleans each html and splits the text into sentences using spacy.
  • Uses sentene transformer 'multi-qa-distilbert-cos-v1' to create sentence embeddings for each sentence and the event question.
  • Calculates the dot product between the event question and each sentence to get a similarity score
  • Concatenates the top sentences from each website.
  • Passes the extracted information as additional information into the prediction prompt.
  • Make an openai api call and provides the prediction prompt, which requests a probability estimation for the event in the user prompt to occur, given the additional information.

@jhehemann jhehemann changed the title Sum url content with ChatGPT before making prediction Sum url content with Sentence Transformers before making prediction Sep 29, 2023
@jhehemann jhehemann changed the title Sum url content with Sentence Transformers before making prediction Sum url content using Sentence Transformers before making prediction Sep 29, 2023
@jhehemann
Copy link
Contributor Author

jhehemann commented Oct 1, 2023

I'm struggling to get it running. Here what I'm trying to archieve:

For gathering the most relevant information on the internet for predicting the question from the prediction market I want to use a sentence embedding model. These models map each sentence into a vector space. By calculating the dot product between each sentence embedding and the question embedding, it is possible to get some kind of semantic similarity between the question and each sentence. The sentences with the most similarity to the prediction market question, tend to be most relevant for making a prediction estimation. I tested different models locally and they perform very good, while being cost efficient. Only the preselected relevant content will be passed to GPT-4 which will make the prediction. The frameworks that are best suited for that are Sentence Transformer (BERT) and Universal Sentence Encoder (TensorFlow). With either model the code does not pass the PR workflow.

Sentence Transformers (BERT)
Sentence Transformers has pytorch as a dependency. PyTorch is a deep learning tensor library and comes with the CUDA API that enables the use of NVIDIA GPUs for computation. The CUDA library is licensed by NVIDIA, which is the reason why the workflow fails (https://github.com/NVIDIA/nccl/blob/master/LICENSE.txt). However it is possible to install PyTorch without CUDA, but only by providing the source url of the specific version in the pyproject.toml file. When I tried this I got an error in the "license compatibility check"-job:

"File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/tox/config/__init__.py", line 1965, in _replace_match
     raise tox.exception.ConfigError(
 tox.exception.ConfigError: ConfigError: No support for the 'url' substitution type"

Universal Sentence Transformer (TensorFlow)
This is a model library from the TensorFlow Hub. The problem here is that the installed TensorFlow version must match >2.14.0,<3.0.0 and Tensorflow requires protobuf (>=3.20.3,<4.21.0 || >4.21.0,<4.21.1 || >4.21.1,<4.21.2 || >4.21.2,<4.21.3 || >4.21.3,<4.21.4 || >4.21.4,<4.21.5 || >4.21.5,<5.0.0dev). But, as the mech depends on protobuf (<=3.20.1,>=3.19) version solving failed.

Is it possible to add support for the url substitution type to enable the use of PyTorch without CUDA in the code? Or alternatively is it possible to extend the authorized licenses with NVIDIA?
Or is there a chance to update protobuf to a version >=3.20.3 to make use of TensorFlow?

I think there is a huge potential in preprocessing for selecting the most relevant information to make a reliable prediction. I would appreciate if you can help me here. Maybe you have suggestions, too?

@dagacha
Copy link

dagacha commented Oct 5, 2023

CI failures are unrelated to the changes here.

@0xArdi The licence related to NVIDIA can be whitelisted, as it is released under Apache 2.0 https://github.com/NVIDIA/NVTX/blob/release-v3/LICENSE.txt

@jhehemann
Copy link
Contributor Author

jhehemann commented Oct 6, 2023

CI failures are unrelated to the changes here.

@0xArdi The licence related to NVIDIA can be whitelisted, as it is released under Apache 2.0 https://github.com/NVIDIA/NVTX/blob/release-v3/LICENSE.txt

@0xArdi
There are 12 components licensed by NVIDIA that do not pass the workflow:

  1. nvidia-cublas-cu12 (12.1.3.1)
  2. nvidia-cuda-cupti-cu12 (12.1.105)
  3. nvidia-cuda-nvrtc-cu12 (12.1.105)
  4. nvidia-cuda-runtime-cu12 (12.1.105)
  5. nvidia-cudnn-cu12 (8.9.2.26)
  6. nvidia-cufft-cu12 (11.0.2.54)
  7. nvidia-curand-cu12 (10.3.2.106)
  8. nvidia-cusolver-cu12 (11.4.5.107)
  9. nvidia-cusparse-cu12 (12.1.0.106)
  10. nvidia-nccl-cu12 (2.18.1)
  11. nvidia-nvjitlink-cu12 (12.2.140)
  12. nvidia-nvtx-cu12 (12.1.105)

It would be possible to define a pytorch version in the pyproject.toml that does not automatically install the NVIDIA CUDA toolkit. For this I have to define the source url of the version in the pyproject.toml. But when I do this the workflow does not pass with the error:

 File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/tox/config/__init__.py", line 1965, in _replace_match
    raise tox.exception.ConfigError(
tox.exception.ConfigError: ConfigError: No support for the 'url' substitution type

If there would be added support for the 'url' substitution type, there would be no need to whitelist the NVIDIA licenses I think, because then the corresponding packages would not be installed with pytorch.

@0xArdi 0xArdi merged commit a4175d7 into valory-xyz:main Oct 13, 2023
4 of 6 checks passed
@0xArdi 0xArdi mentioned this pull request Oct 13, 2023
@0xarmagan
Copy link

Hello this is Armagan from Gnosis

Congratulations on winning a prize for the Best Mech Mech Tool in the Prediction Agent Hackathon! Your innovative contribution stood out, and we're excited to award your work.

Please reply to this comment with your Gnosis Chain wallet address so we can proceed with transferring your prize in xDAI.

Thank you for your fantastic work and looking forward to your continued contributions!

@jhehemann
Copy link
Contributor Author

Hello this is Armagan from Gnosis

Congratulations on winning a prize for the Best Mech Mech Tool in the Prediction Agent Hackathon! Your innovative contribution stood out, and we're excited to award your work.

Please reply to this comment with your Gnosis Chain wallet address so we can proceed with transferring your prize in xDAI.

Thank you for your fantastic work and looking forward to your continued contributions!

Hi @0xarmagan, thank you for the great news! I'm honored and looking forward to future collaborations and contributions. My wallet address is 0x4a143CdC6B9A8AFb4D54E01A95A66c29ce8591DF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants