Sum url content using Sentence Transformers before making prediction #114

jhehemann · 2023-09-21T15:40:09Z

This PR implements a new tool that uses the openAI API to extract the most relevant information from the html content of a browsed url. This can be seen as a preprocessing of each url's content, which is stored inside the additional information variable. This additional information leads to more accurate predictions by the tool.

Work in progress:

currently the get_website_summary() function feeds the gpt-3.5-turbo model the whole html content. As the model can only process 4K tokens, it throws an exception and skips the url if the html content exceeds the limit.

…ormation from each website in order to estimate the prediction market question later. This strategy replaced the strategy where only the first 300 characters from each website's html text were extracted and used for prediction.

… extract only the relevant sections.

…up file

…T-4 is able to give p_no = 1 when specified event deadline exceeded and no information indicating for p_yes are found.

…f the event happening by a specific deadline only if it assumes that it has access to up-to-date information.

…date information queried from a search engine; Must pay attention on keyword BEFORE in 'event question'.

…t instructions to also pay attention on the keyword ON in 'event question'.

…nt dates and surrounding context

…pletion; added minor changes accounting for not exceeding string limits

… for better efficiency

…ents

…ulating similarities

0xArdi

Looks great, thanks!

0xArdi · 2023-09-25T22:37:39Z

CI failures are unrelated to the changes here.

…1 for better performance

…arameter names.

…es and return values

…idered by gpt-4 yet.

…an event happening BY a certain date. Consideres if the event has already happened.

jhehemann

This tool does the following tasks:

Generates search engine queries with the openai api based on a user promt
Get the top urls for each query via the google custom search api
Loads and cleans each html and splits the text into sentences using spacy.
Uses sentene transformer 'multi-qa-distilbert-cos-v1' to create sentence embeddings for each sentence and the event question.
Calculates the dot product between the event question and each sentence to get a similarity score
Concatenates the top sentences from each website.
Passes the extracted information as additional information into the prediction prompt.
Make an openai api call and provides the prediction prompt, which requests a probability estimation for the event in the user prompt to occur, given the additional information.

This reverts commit 6df7711.

jhehemann · 2023-10-01T16:30:47Z

I'm struggling to get it running. Here what I'm trying to archieve:

For gathering the most relevant information on the internet for predicting the question from the prediction market I want to use a sentence embedding model. These models map each sentence into a vector space. By calculating the dot product between each sentence embedding and the question embedding, it is possible to get some kind of semantic similarity between the question and each sentence. The sentences with the most similarity to the prediction market question, tend to be most relevant for making a prediction estimation. I tested different models locally and they perform very good, while being cost efficient. Only the preselected relevant content will be passed to GPT-4 which will make the prediction. The frameworks that are best suited for that are Sentence Transformer (BERT) and Universal Sentence Encoder (TensorFlow). With either model the code does not pass the PR workflow.

Sentence Transformers (BERT)
Sentence Transformers has pytorch as a dependency. PyTorch is a deep learning tensor library and comes with the CUDA API that enables the use of NVIDIA GPUs for computation. The CUDA library is licensed by NVIDIA, which is the reason why the workflow fails (https://github.com/NVIDIA/nccl/blob/master/LICENSE.txt). However it is possible to install PyTorch without CUDA, but only by providing the source url of the specific version in the pyproject.toml file. When I tried this I got an error in the "license compatibility check"-job:

"File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/tox/config/__init__.py", line 1965, in _replace_match
     raise tox.exception.ConfigError(
 tox.exception.ConfigError: ConfigError: No support for the 'url' substitution type"

Universal Sentence Transformer (TensorFlow)
This is a model library from the TensorFlow Hub. The problem here is that the installed TensorFlow version must match >2.14.0,<3.0.0 and Tensorflow requires protobuf (>=3.20.3,<4.21.0 || >4.21.0,<4.21.1 || >4.21.1,<4.21.2 || >4.21.2,<4.21.3 || >4.21.3,<4.21.4 || >4.21.4,<4.21.5 || >4.21.5,<5.0.0dev). But, as the mech depends on protobuf (<=3.20.1,>=3.19) version solving failed.

Is it possible to add support for the url substitution type to enable the use of PyTorch without CUDA in the code? Or alternatively is it possible to extend the authorized licenses with NVIDIA?
Or is there a chance to update protobuf to a version >=3.20.3 to make use of TensorFlow?

I think there is a huge potential in preprocessing for selecting the most relevant information to make a reliable prediction. I would appreciate if you can help me here. Maybe you have suggestions, too?

dagacha · 2023-10-05T13:13:46Z

CI failures are unrelated to the changes here.

@0xArdi The licence related to NVIDIA can be whitelisted, as it is released under Apache 2.0 https://github.com/NVIDIA/NVTX/blob/release-v3/LICENSE.txt

jhehemann · 2023-10-06T11:08:52Z

CI failures are unrelated to the changes here.

@0xArdi The licence related to NVIDIA can be whitelisted, as it is released under Apache 2.0 https://github.com/NVIDIA/NVTX/blob/release-v3/LICENSE.txt

@0xArdi
There are 12 components licensed by NVIDIA that do not pass the workflow:

nvidia-cublas-cu12 (12.1.3.1)
nvidia-cuda-cupti-cu12 (12.1.105)
nvidia-cuda-nvrtc-cu12 (12.1.105)
nvidia-cuda-runtime-cu12 (12.1.105)
nvidia-cudnn-cu12 (8.9.2.26)
nvidia-cufft-cu12 (11.0.2.54)
nvidia-curand-cu12 (10.3.2.106)
nvidia-cusolver-cu12 (11.4.5.107)
nvidia-cusparse-cu12 (12.1.0.106)
nvidia-nccl-cu12 (2.18.1)
nvidia-nvjitlink-cu12 (12.2.140)
nvidia-nvtx-cu12 (12.1.105)

It would be possible to define a pytorch version in the pyproject.toml that does not automatically install the NVIDIA CUDA toolkit. For this I have to define the source url of the version in the pyproject.toml. But when I do this the workflow does not pass with the error:

 File "/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/tox/config/__init__.py", line 1965, in _replace_match
    raise tox.exception.ConfigError(
tox.exception.ConfigError: ConfigError: No support for the 'url' substitution type

If there would be added support for the 'url' substitution type, there would be no need to whitelist the NVIDIA licenses I think, because then the corresponding packages would not be installed with pytorch.

0xarmagan · 2023-10-26T15:28:47Z

Hello this is Armagan from Gnosis

Congratulations on winning a prize for the Best Mech Mech Tool in the Prediction Agent Hackathon! Your innovative contribution stood out, and we're excited to award your work.

Please reply to this comment with your Gnosis Chain wallet address so we can proceed with transferring your prize in xDAI.

Thank you for your fantastic work and looking forward to your continued contributions!

jhehemann · 2023-10-27T09:02:17Z

Hello this is Armagan from Gnosis

Congratulations on winning a prize for the Best Mech Mech Tool in the Prediction Agent Hackathon! Your innovative contribution stood out, and we're excited to award your work.

Please reply to this comment with your Gnosis Chain wallet address so we can proceed with transferring your prize in xDAI.

Thank you for your fantastic work and looking forward to your continued contributions!

Hi @0xarmagan, thank you for the great news! I'm honored and looking forward to future collaborations and contributions. My wallet address is 0x4a143CdC6B9A8AFb4D54E01A95A66c29ce8591DF.

jhehemann added 3 commits September 20, 2023 15:00

feat: Duplicated prediction_request.py tool and added print statement

6915fad

feat: Added print statements to follow script steps.

6ce9679

jhehemann marked this pull request as ready for review September 21, 2023 15:40

jhehemann added 14 commits September 22, 2023 16:00

feat: Added preprocessing website content with BERT language model to…

97ffbd1

… extract only the relevant sections.

chore: Reduce GPU usage, added batched processing, reduced size of so…

3c6a285

…up file

feat: Changed prompts to give more concise and clear instructions. GP…

06dde74

…T-4 is able to give p_no = 1 when specified event deadline exceeded and no information indicating for p_yes are found.

feat: Adjusted prompt instructions. GPT-4 now decreases probability o…

7c73baf

…f the event happening by a specific deadline only if it assumes that it has access to up-to-date information.

feat: Adjusted prompt instructions - Additional information is up-to-…

5fa51f9

…date information queried from a search engine; Must pay attention on keyword BEFORE in 'event question'.

chore: Removed duplicated sentences before processing; Adjusted promp…

63c67b8

…t instructions to also pay attention on the keyword ON in 'event question'.

feat: Added funcionality for better extraction of isolated but releva…

452fbc0

…nt dates and surrounding context

chore: Formatted code and added comments

7880ffc

feat: Improved url requests, minor improvements in different code parts.

ccd7c7e

feat: Added token truncation before passing prompt to openai chat com…

db52be5

…pletion; added minor changes accounting for not exceeding string limits

feat: Added conten type check before url request; removed retry logic…

f0ec8e9

… for better efficiency

chore: Allowed for redirects for header request; Removed print statem…

11292d6

…ents

chore: minor changes to print statements, variable names.

fcf3116

feat: Changed sentence transformers model and used dot score for calc…

bf68bbb

…ulating similarities

0xArdi approved these changes Sep 25, 2023

View reviewed changes

jhehemann added 7 commits September 26, 2023 02:58

feat: Changed to sentence transformer model multi-qa-distilbert-cos-v…

0f20bb2

…1 for better performance

chore: Added helper functions and did minor changes in variable and p…

3f001c9

…arameter names.

feat: Changed prediction prompt formulation; Minor changes to variabl…

f2e45ac

…es and return values

chore: [unfinished] changed prediction prompt and date representations.

c802085

fead: added new instructions regarding deadline. Not effectively cons…

01ef920

…idered by gpt-4 yet.

feat: Reasonable estimations by GPT-4 for event questions specifying …

f8d3e70

…an event happening BY a certain date. Consideres if the event has already happened.

chore: Minor change to the prediction prompt

29a4afd

jhehemann commented Sep 27, 2023

View reviewed changes

jhehemann added 2 commits September 29, 2023 12:47

chore: Removed and updated packages

6df7711

Revert "chore: Removed and updated packages"

3fb3f09

This reverts commit 6df7711.

Merge remote-tracking branch 'upstream/main' into SUM_URL_CONTENT

4f376bc

jhehemann changed the title ~~Sum url content with ChatGPT before making prediction~~ Sum url content with Sentence Transformers before making prediction Sep 29, 2023

jhehemann changed the title ~~Sum url content with Sentence Transformers before making prediction~~ Sum url content using Sentence Transformers before making prediction Sep 29, 2023

chore: Removed and reinstalled packages

6617bfe

jhehemann force-pushed the SUM_URL_CONTENT branch from 5bd2787 to 6617bfe Compare September 29, 2023 16:33

jhehemann added 5 commits September 30, 2023 14:25

feat: Added marker for torch install in pyproject.toml

2029e6d

chore: Added marker for torch version

3d6c1c7

feat: Added download link for torch and torchvision for cpu only version

0305d61

chore: Update dependencies

bf5df2e

chore: Update poetry

920c69b

jhehemann added 3 commits October 6, 2023 10:05

chore: Update poetry

c9c2779

Merge remote-tracking branch 'upstream/main' into update-for-pr

355e320

chore: update poetry

ada65bb

0xArdi merged commit a4175d7 into valory-xyz:main Oct 13, 2023
4 of 6 checks passed

0xArdi mentioned this pull request Oct 13, 2023

chore: remove deps #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum url content using Sentence Transformers before making prediction #114

Sum url content using Sentence Transformers before making prediction #114

jhehemann commented Sep 21, 2023

0xArdi left a comment

0xArdi commented Sep 25, 2023

jhehemann left a comment

jhehemann commented Oct 1, 2023 •

edited

Loading

dagacha commented Oct 5, 2023

jhehemann commented Oct 6, 2023 •

edited

Loading

0xarmagan commented Oct 26, 2023

jhehemann commented Oct 27, 2023

Sum url content using Sentence Transformers before making prediction #114

Sum url content using Sentence Transformers before making prediction #114

Conversation

jhehemann commented Sep 21, 2023

0xArdi left a comment

Choose a reason for hiding this comment

0xArdi commented Sep 25, 2023

jhehemann left a comment

Choose a reason for hiding this comment

jhehemann commented Oct 1, 2023 • edited Loading

dagacha commented Oct 5, 2023

jhehemann commented Oct 6, 2023 • edited Loading

0xarmagan commented Oct 26, 2023

jhehemann commented Oct 27, 2023

jhehemann commented Oct 1, 2023 •

edited

Loading

jhehemann commented Oct 6, 2023 •

edited

Loading