Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Intel® Neural Speed v1.0a Release

Compare
Choose a tag to compare
@kevinintel kevinintel released this 22 Mar 11:10
· 71 commits to main since this release
1051182

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

  • Improve performance on CPU client
  • Support batching and submit GPT-J results to MLPerf v4.0

Improvements

  • Support continuous batching and beam search inference (7c2199 )
  • Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
  • Support FFN_fusion for the ChatGLM2(96fadd )
  • Enable loading model from modelscope (ad3d19 )
  • Extend long input tokens length (eb41b9 , e76a58e )
  • [BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
  • [BesTLA] New thread pool and hybrid dispatcher (fd19a44 )

Examples

  • Enable Mixtral 8x7B (9bcb612 )
  • Enable Mistral-GPTQ (96dc55 )
  • Implement the YaRN rop scaling feature (6c36f54 )
  • Enable Qwen 1-5 (750b35 )
  • Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
    • Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3)
  • Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
  • Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
  • Enable ChatGLM3 (94e74d )
  • Enable Gemma-2B (e4c5f71 )

Bug Fixing

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • Ubuntu 22.04