This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
Intel® Neural Speed v1.0a Release
Highlights
Improvements
Examples
Bug Fixing
Validated Configurations
Highlights
- Improve performance on CPU client
- Support batching and submit GPT-J results to MLPerf v4.0
Improvements
- Support continuous batching and beam search inference (7c2199 )
- Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
- Support FFN_fusion for the ChatGLM2(96fadd )
- Enable loading model from modelscope (ad3d19 )
- Extend long input tokens length (eb41b9 , e76a58e )
- [BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
- [BesTLA] New thread pool and hybrid dispatcher (fd19a44 )
Examples
- Enable Mixtral 8x7B (9bcb612 )
- Enable Mistral-GPTQ (96dc55 )
- Implement the YaRN rop scaling feature (6c36f54 )
- Enable Qwen 1-5 (750b35 )
- Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3) - Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
- Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
- Enable ChatGLM3 (94e74d )
- Enable Gemma-2B (e4c5f71 )
Bug Fixing
- Fix convert_quantized model bug (37d01f3 )
- Fix Autoround acc regression (991c35 )
- Fix Qwen load error (2309fbb )
- Fix the GGUF convert issue (5293ffa )
Validated Configurations
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04