Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context retrieval only works for first user message #444

Open
1 of 2 tasks
wukaixingxp opened this issue Nov 13, 2024 · 2 comments
Open
1 of 2 tasks

Context retrieval only works for first user message #444

wukaixingxp opened this issue Nov 13, 2024 · 2 comments
Labels
RAG Relates to RAG functionality of the agents API

Comments

@wukaixingxp
Copy link
Contributor

wukaixingxp commented Nov 13, 2024

llama-stack install from source:https://github.com/meta-llama/llama-stack/tree/cherrypick-working

System Info

python -m "torch.utils.collect_env"
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.34

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.4.3-0_fbk14_zion_2601_gcd42476b84e9-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100
GPU 1: NVIDIA H100
GPU 2: NVIDIA H100
GPU 3: NVIDIA H100
GPU 4: NVIDIA H100
GPU 5: NVIDIA H100
GPU 6: NVIDIA H100
GPU 7: NVIDIA H100

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.2
/usr/lib64/libcudnn_adv_infer.so.8.9.2
/usr/lib64/libcudnn_adv_train.so.8.9.2
/usr/lib64/libcudnn_cnn_infer.so.8.9.2
/usr/lib64/libcudnn_cnn_train.so.8.9.2
/usr/lib64/libcudnn_ops_infer.so.8.9.2
/usr/lib64/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 82%
CPU max MHz: 3707.8120
CPU min MHz: 1500.0000
BogoMIPS: 4792.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization: AMD-V
L1d cache: 6 MiB (192 instances)
L1i cache: 6 MiB (192 instances)
L2 cache: 192 MiB (192 instances)
L3 cache: 768 MiB (24 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-95,192-287
NUMA node1 CPU(s): 96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.2
[pip3] onnxruntime==1.19.2
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

There is a llama3.1 model card and llama3.2 model card in the database, and I tried to ask

user_prompts = [
        "What is the name of the llama model released on October 24, 2024?",
        "What about Llama 3.1 model, what is the release date for it?",
    ]

The RAG only retrieve the context from llama3.2 model card for first message but did not do retrieval for the second message, the context is still llama3.2 model card from first message. It will be great if we can have Context Retrieval for every User message.

My code is here and use python rag_main.py localhost 5000 ./example_data/ to start this example

Error logs

Inserted 3 documents into bank: rag_agent_docs
Created bank: rag_agent_docs
Found 2 models [ModelDefWithProvider(identifier='Llama3.2-11B-Vision-Instruct', llama_model='Llama3.2-11B-Vision-Instruct', metadata={}, provider_id='meta-reference', type='model'), ModelDefWithProvider(identifier='Llama-Guard-3-1B', llama_model='Llama-Guard-3-1B', metadata={}, provider_id='meta1', type='model')]
Use model: Llama3.2-11B-Vision-Instruct
Generating response for: What is the name of the llama model released on October 24, 2024?
messages [{'role': 'user', 'content': 'What is the name of the llama model released on October 24, 2024?'}]
----input_query------- What is the name of the llama model released on October 24, 2024?
Turn(input_messages=[UserMessage(content='What is the name of the llama model released on October 24, 2024?', role='user', context="Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\n\n=== END-RETRIEVED-CONTEXT ===\n")], output_attachments=[], output_message=CompletionMessage(content='The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.\n\nIt does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.\n\nTo answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents.', role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='de83a6c2-5643-42b0-9c89-01640439b524', started_at=datetime.datetime(2024, 11, 13, 9, 48, 44, 297982), steps=[MemoryRetrievalStep(inserted_context=['Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n', "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", '\n=== END-RETRIEVED-CONTEXT ===\n'], memory_bank_ids=['rag_agent_docs'], step_id='d916a947-4dee-42e2-ac1a-410d54c7da3d', step_type='memory_retrieval', turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=None, started_at=None), InferenceStep(inference_model_response=CompletionMessage(content='The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.\n\nIt does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.\n\nTo answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents.', role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='603d12ab-f127-46de-9ccb-4e07bdccc7e3', step_type='inference', turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=None, started_at=None)], turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=datetime.datetime(2024, 11, 13, 9, 48, 50, 996089))
Generating response for: What about Llama 3.1 model, what is the release date for it?
messages [{'role': 'user', 'content': 'What about Llama 3.1 model, what is the release date for it?'}]
----input_query------- What about Llama 3.1 model, what is the release date for it?
Turn(input_messages=[UserMessage(content='What about Llama 3.1 model, what is the release date for it?', role='user', context="Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\n
Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\n\n=== END-RETRIEVED-CONTEXT ===\n")], output_attachments=[], output_message=CompletionMessage(content="The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.\n\nIt appears that there is no information about the Llama 3.1 model in the provided documents.", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='de83a6c2-5643-42b0-9c89-01640439b524', started_at=datetime.datetime(2024, 11, 13, 9, 48, 51, 113170), steps=[MemoryRetrievalStep(inserted_context=['Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n', "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\n**Training Energy Use", '\n=== END-RETRIEVED-CONTEXT ===\n'], memory_bank_ids=['rag_agent_docs'], step_id='e41a178b-182c-444c-8cb6-544979d75a17', step_type='memory_retrieval', turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=None, started_at=None), InferenceStep(inference_model_response=CompletionMessage(content="The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.\n\nIt appears that there is no information about the Llama 3.1 model in the provided documents.", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='dc72b93c-8f17-44e4-b50f-5f272b11327a', step_type='inference', turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=None, started_at=None)], turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=datetime.datetime(2024, 11, 13, 9, 48, 54, 441075))
The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.

It does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.

To answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents.
The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.

It appears that there is no information about the Llama 3.1 model in the provided documents.

Expected behavior

It will be great if we can have Context Retrieval for every User message.

@ashwinb ashwinb added the RAG Relates to RAG functionality of the agents API label Nov 13, 2024
@ashwinb
Copy link
Contributor

ashwinb commented Nov 19, 2024

@dineshyv this is the RAG issue @init27 was mentioning earlier

@aidando73
Copy link

aidando73 commented Nov 30, 2024

@wukaixingxp, @ashwinb I've just had a look at this.

The RAG only retrieve the context from llama3.2 model card for first message but did not do retrieval for the second message, the context is still llama3.2 model card from first message.

I've done a bit of testing and the RAG query that is generated actually joins together all the messages:

In your case for messages:

user_prompts = [
    "What is the name of the llama model released on October 24, 2024?",
    "What about Llama 3.1 model, what is the release date for it?",
]

it generates:

query: You are a helpful assistant that can answer questions based on provided documents. Return your answer short and concise, less than 50 words. What is the name of the llama model released on October 24, 2024? What about Llama 3.1 model, what is the release date for it?

I added this print statement. My llama-stack-apps code is here

In some cases, this works for me:

image
query:  You are a helpful assistant that can answer questions based on provided documents. Return your answer short and concise, less than 50 words. What is the name of the llama model released on October 24, 2024? What about Llama 3.1 model, what is the release date for it?
Batches: 100% 1/1 [00:00<00:00, 180.31it/s]
05:19:03.638 [ERROR] [/alpha/agents/turn/create.retrieve_rag_context] Using 3 chunks; reached max tokens in context: 400
05:19:03.649 [INFO] [/alpha/agents/turn/create] role='user' content='What about Llama 3.1 model, what is the release date for it?' context='Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.1.md; content:_1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technica...<more>...id:llama_3.1.md; content: for improved inference scalability.\n\nModel Release Date: July 23, 2024.\n\nStatus: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.\n\nLicense: A custom commercial license, the Llama 3.1 Community License, is available at: [https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE](https://github.com/meta-\n\n=== END-RETRIEVED-CONTEXT ===\n'
05:19:05.048 [INFO] [/alpha/agents/turn/create] Assistant: According to the documents, Llama 3.1 model was released on July 23, 2024.

branched off of your branch here. Print statement in llama-stack here

iiuc, the problem here is that the search results are inconsistent or a bit poor.

I've ran some of my own queries against the faiss index and they're a bit inconsistent:

Query: "Llama 3.2 3B Instruct"

Top 2 results are:

Index 8:
Content: _1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go [here](https://

Index 97:
Content: .com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https

Query: "What are some small Llama models I can run on small devices like my phone?"

Index 175:
Content:  the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use L

Index 8:
Content: _1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go [here](https://

Index 97:
Content: .com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https

Index 152:
Content: B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version.

(Last result is relevant but the first 3 aren't that useful)

source

If I have a bit of time I might see how we could improve them. Maybe adding keyword search [1], trying different/bigger embedding models [2], different chunking schemes [3] might help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RAG Relates to RAG functionality of the agents API
Projects
None yet
Development

No branches or pull requests

3 participants