Skip to content

Releases: furiosa-ai/inference-compression

MLPerf4.1-v3.12.1

05 Jul 14:38
42cd3ce
Compare
Choose a tag to compare

What's Changed

Full Changelog: MLPerf4.1-v3.12...MLPerf4.1-v3.12.1

MLPerf4.1-llama-v3.12.1

05 Jul 14:38
6b8b642
Compare
Choose a tag to compare

What's Changed

Full Changelog: MLPerf4.1-llama-v3.11...MLPerf4.1-llama-v3.12.1

MLPerf4.1-v3.12

02 Jul 12:49
3f6a1c9
Compare
Choose a tag to compare

What's Changed

Full Changelog: MLPerf4.1-v3.11...MLPerf4.1-v3.12

MLPerf4.1-llama-v3.11

01 Jul 04:21
c12fac0
Compare
Choose a tag to compare

MLPerf4.1-v3.11

27 Jun 06:36
b7df4c5
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: MLPerf4.1-v3.8...MLPerf4.1-v3.11

MLPerf4.1-v3.8

07 Jun 09:37
a2cf147
Compare
Choose a tag to compare

What's Changed

Full Changelog: MLPerf4.1-v3.5...MLPerf4.1-v3.8

MLPerf4.1-v3.5

08 May 08:01
30c6fc9
Compare
Choose a tag to compare

What's Changed

  • MCP 변경사항 반영: decode_graph create_quant_sim 동작 시 quantized_prefill_graph 입력하도록 수정함

MLPerf4.1-v3.4

29 Apr 08:28
Compare
Choose a tag to compare
  • Ported QuantPagedAttentionGenerator for paged_attention_rope
  • It is now possible to generate with paged_attention_rope.GPTJForCausalLM by setting the argument 'model_source' as 'paged_attention_rope' for language/gpt-j/main.py

MLPerf4.1-v3.1

09 Apr 09:03
Compare
Choose a tag to compare

What's Changed

  • GPT-J preallocated (preallocated_concat_rope.py) 추가 및 paged_attention_concat_rope 업데이트 반영
    • paged_attention_concat_rope.py
      • Qlevel 4 fx graph 추출 및 generation (greedy_search) 까지 확인
      • furiosa-llm-repo에 QuantPagedAttentionGenerator 정의
        • furiosa-llm.PagedAttentionGeneator를 torch model이 아닌 fx graph로 수행시키기 위한 구현
    • preallocated_concat_rope.py
      • Qlevel4 fx graph 변환 완료
      • QuanPreallocatedGenerator 구현 완료
        • QuantPagedAttentionGenerator와 동일한 역할을 수행함
    • 추후 작업: 수정된 rope 로 generate 하는 기능
  • Model-compressor-private: torch.dynamo.export 로 graph break 가 발생하던 QLV4EmbeddingMOD 수정됨 (성환님이 확인/반영해주신 issue)
  • Furiosa-llm-models: dtype cast 를 지워서 f32에서 i32로 변환하는 부분이 FxToFxp가 아니라 Cast 로 들어오고 있던 상황 해결 (issue slack)

Full Changelog: MLPerf4.1-v2.1...MLPerf4.1-v3.1

MLPerf4.1-v2.1

02 Apr 17:40
Compare
Choose a tag to compare
  • Performance
    • BERT: {"exact_match": 83.9546 (100.32%), "f1": 91.05177 (100.19%)} [W8A8KV8, calibrated and evaluated on A100]
    • GPT-J original: rouge1: 43.0717(100.20%) rouge2:20.1398(100.08%) rougeL:30.0108(100.08%), 'gen_len': 3984079(99.18%) [W8A8KV8 + Smoothquant, calibrated and evaluated on H100]
  • Updates
    • Changed zero-point shapes from per-head to per-tensor, affecting the matmul operation with zero-point equalizing
    • Added emulation in, emulation out to FP32 for bf16 x bf16 dot product.
    • Ported GPTJ-paged_attention_concat_rope model