This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Intel® Neural Speed v1.0a Release

kevinintel released this 22 Mar 11:10

· 71 commits to main since this release

1051182

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

Improve performance on CPU client
Support batching and submit GPT-J results to MLPerf v4.0

Improvements

Support continuous batching and beam search inference (7c2199 )
Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
Support FFN_fusion for the ChatGLM2(96fadd )
Enable loading model from modelscope (ad3d19 )
Extend long input tokens length (eb41b9 , e76a58e )
[BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
[BesTLA] New thread pool and hybrid dispatcher (fd19a44 )

Examples

Enable Mixtral 8x7B (9bcb612 )
Enable Mistral-GPTQ (96dc55 )
Implement the YaRN rop scaling feature (6c36f54 )
Enable Qwen 1-5 (750b35 )
Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3)
Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
Enable ChatGLM3 (94e74d )
Enable Gemma-2B (e4c5f71 )

Bug Fixing

Fix convert_quantized model bug (37d01f3 )
Fix Autoround acc regression (991c35 )
Fix Qwen load error (2309fbb )
Fix the GGUF convert issue (5293ffa )

Validated Configurations

Python 3.9, 3.10, 3.11
Ubuntu 22.04

Assets 2