Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AI documentation #491

Merged
merged 12 commits into from
Aug 12, 2024
Merged

Add AI documentation #491

merged 12 commits into from
Aug 12, 2024

Conversation

InAnYan
Copy link
Contributor

@InAnYan InAnYan commented Jun 27, 2024

No description provided.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. The general structure is too technical and should be made more user-oriented. Especially, the contents of the proposed blog entry are missing.

The documentation should also include the link https://platform.openai.com/playground/chat?models=gpt-4o and some explaining text (that one can play around with the parameters in a minimal setting)


## Chat model

**Type**: enumeration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this is a computer-science term. "List" is more user-friendly, isn't it?


**Type**: enumeration

**Requirements**: choose one available from combo box
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trivial. Just remove it?


**Requirements**: choose one available from combo box

Chat model specifies what AI models can you use. This will differ from one provider to other. Models vary in their accuracy, knowledge of the world, context window (what amount of information can they process).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go first, because this is kind of an introduction to the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You propose to move the description above and "type" and "requirements" (except for this one) to be lower?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Both in documentation and scientific papers, one starts with some intro text and then goes into details. I think, the type is the least important information to know when one wants to know what "Chat model" means.


Different embedding models have different performance: this includes accuracy and how fast embeddings can be computed. `Q` at the end of model name usually means *quantized* (meaning *reduced*, *simplified*). These models are fast, but provide less accuracy.

Currently only local embedding models are supported. That means you don't have to provide a new API key and all the logic will be run on your machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First occurence of "API key". It should be explained, where to get one.

@@ -0,0 +1,79 @@
# AI functionality in JabRef

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, much text is missing. I think, you can just copy and paste from the blog post.

Reason: We try to collect (and update) documentation at docs.jabref.org - and not blog.jabref.org. Blog.jabref.org is "only" For advertising new features, not providing deep explanations. -- user-documentation should be self-contained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What text is missing there?

And what technical details I provided in blog post? I think most of the blog post consists of 1) showing new features, 2) tutorial (a crappy one) how to get an OpenAI API key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sections from https://github.com/InAnYan/blog.jabref.org/blob/ai-1/_posts/2024-07-01-AI-chatting.md

  • AI chat tab
  • How does this work?
  • How to get an OpenAI API key?
  • AI preferences (at least the screenshot and maybe some intro text)


**Requirements**: > 0

The "message window size" in a Large Language Model (LLM) refers to the number of recent messages or interactions that the model remembers during a conversation. This parameter determines how much context the LLM considers when generating responses.
Copy link
Member

@ThiloteE ThiloteE Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://docs.gpt4all.io/gpt4all_desktop/settings.html#sampling-settings for different explanation of settings:
image
That documentation is not completely correct either. Context window is measured in tokens. More specifically, it is the sum of tokens in the system prompt + tokens in user prompt + tokens in chunks/snippets of embeddings that were added to the user prompt + tokens in the responses by the model.

InApp documentation of gpt4all:
image

Copy link
Member

@ThiloteE ThiloteE Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the best explanation of model settings: https://artefact2.github.io/llm-sampling/index.xhtml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

artefact2.github.io/llm-sampling/index.xhtml might be a good explanation, but I think it's too technical and mathematically inclined. And I haven't included those parameters to the JabRef (except temperature)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Message window size" is the same as "context length" in GPT4All, as far as I understand it and this paragraph should be rewritten. The description of Message window size is not accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it really the number of messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in langchain4j it's really message number. It handles the context length of the chat history somehow on it's own

Copy link
Member

@ThiloteE ThiloteE Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if a user message has 50 tokens or if a user message has 5000 tokens. Will both be counted as one? Models have a maximum context window size they are trained on and those are counted by tokens. Model responses degrade steeply, if going above that limit, regardless if it is in the middle of the sentence or not. It depends on the amount of tokens. It does not make sense to count by number of messages, as each message can have different amounts of tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithms for managing chat messages in langchain4j are very messy, so I decided not to touch them at all.

I've double checked the algorithm that langchain4j uses in MessageWindowChatMemory:

  1. It tries to estimate the size of chat history
  2. When it's overflowed it removes several old messages to conform to context size requirements

Copy link
Member

@ThiloteE ThiloteE Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Message Window Size is derived from LangChain specific code related to MessageWindowChatMemory. I tried finding "Message Window Size" in their documentation, but could not. What I found was https://docs.langchain4j.dev/tutorials/chat-memory/. We basically have now made up our own name for this behaviour. It is fine, but we should take care to not confuse it with "Context Window Size" of an LLM. Both are not the same. As I understand it, our "Message Window Size" will remove messages only, if it overflows. Regardless, if users have already exceeded the models context window size long ago. Please correct me, if I am wrong.

This deserves a tiny rewrite and I will resolve this conversation, as soon as we have disentangled this issue.

@InAnYan
Copy link
Contributor Author

InAnYan commented Jul 6, 2024

Hmm, you say that the general structure is too technical. You mean the blog post or documentation?

In both case, I can't really understand what is technical and what is not. Like, how can I simplify it further?

I don't really like the idea to describe parameters in one sentence, as:

  1. It leaves the users with more questions than answers
  2. It doesn't provide enough information on how it works, why we need it and what are the consequences of changing a parameter (consequences are probably more important)

@InAnYan
Copy link
Contributor Author

InAnYan commented Jul 6, 2024

Oh, and also why AI is not advanced?

  • It's the thing that expresses the internals of AI functionality (similar to en\advanced\fields.md that explains different BibTex fields and more).
  • It's the thing that you usually don't touch (like most of the time user spends in GUI, but not using the command line en/advanced/commandline.md.

@koppor
Copy link
Member

koppor commented Jul 8, 2024

Hmm, you say that the general structure is too technical. You mean the blog post or documentation?

I don't know, who "you" is here. - I meant the user documentation. This pull request here.

The blog post is discussed at https://github.com/InAnYan/blog.jabref.org/blob/ai-1/_posts/2024-07-01-AI-chatting.md

@koppor
Copy link
Member

koppor commented Jul 8, 2024

Oh, and also why AI is not advanced?

I didn't find the text I wrote, therefore, I rephase.

  1. I assume a user of the documentation DOES NOT follow the blog posts
  2. I assume documentation is updated
  3. I assume the AI feature will be in JabRef in 2030
  4. I assume a reader does not read blog posts of 2024 in 2030 to understand the main idea

Thus, the documentation should introduce the AI feature

The AI feature is a prominent tab in the entry editor. It appears on first start. Thus, it should be explained "easily".

The AI feature will be an important, well-working feature of JabRef. Thus, it should be explained.

I agree that some special settings of OpenAI might be detailed. I think, the should be at the second part of the AI documentation, but the first part should be some intro part.

* It's the thing that expresses the internals of AI functionality (similar to [en\advanced\fields.md](https://github.com/InAnYan/jabref-user-documentation/blob/ai-1/en/advanced/fields.md?rgh-link-date=2024-07-06T18%3A16%3A47Z) that explains different BibTex fields and more).

I think, I also wrote, that we as team should move that somewhere else. I was thinking of https://www.bibtex.org/ even. In "advanced" it is the wrong position. Users should really be aware of the meanings of the fields. However, the current explanation is not made for normal users (e.g., not a grouped explanation, but some random-appearing collection).

* It's the thing that you usually don't touch (like most of the time user spends in GUI, 

I repeat: The introdcution text of AI and some explanations of the inner works have to be in the user documentation.

I add: I agree that detail configurations are not meant for the average users. -- I think, the AI documentaiton can be on one page and not on two pages (into in root and settings in advanced). Reason: I think, that people can scroll and know when to step reading. -- We need to write the page from high-level (intro text) to lower level (details).

@InAnYan
Copy link
Contributor Author

InAnYan commented Jul 8, 2024

@koppor I had an idea, but was more focused on fixing bugs in ai-pr-1 pull request.

What about this:

  1. Parameters: enable chatting with attached PDF files and OpenAI token are out of Expert settings. What if we put the explanation of these parameters inside ai/ai.md. Inside Enable chatting with attached PDF files we will put the introduction text to AI features.
  2. Every other parameter is inside Expert settings section. What if we put the documentation for those parameters inside ai/expert-settings.md?

@koppor
Copy link
Member

koppor commented Jul 9, 2024

What about this:

I miss the "why" there. AKA "reasoning" AKA providing rationales.

My reasoning for one page was:

I agree that detail configurations are not meant for the average users. -- I think, the AI documentaiton can be on one page and not on two pages (into in root and settings in advanced). Reason: I think, that people can scroll and know when to step reading. -- We need to write the page from high-level (intro text) to lower level (details).

Which point do you have another opinion? Why do you think, splitting up things is easier?

There is one preference page. Thus, there should also be one page explaining each element.

Note that GitBook offers a side pane.

image


See https://medium.com/olzzio/y-statements-10eb07b5a177 for more motivation, why one should reason about options.

@koppor
Copy link
Member

koppor commented Jul 9, 2024

Pleae also take the time to fix the markdown lint issues:

Error: en/advanced/ai.md:23:151 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
Error: en/advanced/ai.md:79:349 MD047/single-trailing-newline Files should end with a single newline charact

You can see it at the github output when clicking on "Details".

You should install markdown-lint on as vs code plugin. Long text: To have automatic editor support, this PR adds markdownlint as recommended plugin for VS.Code.


The "chunk overlap" parameter determines how much text from adjacent chunks is shared when dividing linked files into segments. This overlap is measured in characters and ensures continuity and context across segmented chunks. By sharing a specified amount of text between adjacent segments, typically at the beginning and/or end of each chunk, the AI model can maintain coherence and understanding of the content across segmented parts. This approach helps enhance the accuracy and relevance of responses generated by the AI from the segmented content.

## Retrieval augmented generation maximum results count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Maximum results count" sounds complicated. I prefer naming this "Maximum number of chunks" or "Maximum number of snippets".


Currently, only local embedding models are supported. That means, you don't have to provide a new API key and all the logic will be run on your machine.

## Instruction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System Message is the standard term used in the industry. I have tried multiple AI chat programs and they all use System Message. Is there a particular reason we deviate from the standard?

@koppor
Copy link
Member

koppor commented Aug 6, 2024

Please link the privacy policy of JabRef where the AI providers will be listed, too.


AI provider is a company or a service that gives you the ability to send requests to and receive responses from LLM. In order to get the response, you also need to send an API key to authenticate and manage billing.

Here is the list of AI providers we currently support: OpenAI, Mistral AI, Hugging Face. Others include Google Vertex AI, Microsoft Azure OpenAI, Anthropic, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "etc." mean? Either the list is complete or incomplete. If it is incomplete, state where one could get the information about the complete list.

Maybe, they can be distinguished between checked and possibly working. --> Think about the Privacy policy. We need to list all providers which JabRef directly supports. For others, users need to check the privacy policy for themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I provided link to this page https://docs.langchain4j.dev/category/language-models

What do you think, is it okay?

en/ai/ai-providers-and-api-keys.md Outdated Show resolved Hide resolved
@koppor koppor merged commit 484f6cd into JabRef:main Aug 12, 2024
3 checks passed
@koppor koppor deleted the ai-1 branch August 12, 2024 20:42
@InAnYan
Copy link
Contributor Author

InAnYan commented Aug 12, 2024

😳

@koppor koppor mentioned this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants