Releases: furiosa-ai/inference-compression
Releases · furiosa-ai/inference-compression
MLPerf4.1-v3.12.1
What's Changed
- Fix pg dataloader by @sunghyuckhong in #74
- Generalize ci by @sunghyuckhong in #67
- gptj qparam immigration code by @jeongin-yun in #78
- use mcp generator by @jeongin-yun in #82
- Save qlv4 by @Mincho0102 in #80
Full Changelog: MLPerf4.1-v3.12...MLPerf4.1-v3.12.1
MLPerf4.1-llama-v3.12.1
What's Changed
- split evaluation by @BeomGeunCho in #72
- Qlv4 save by @Mincho0102 in #77
- add args and statedict by @Mincho0102 in #81
- Port mcp generator & generalize ci for llama by @sunghyuckhong in #83
Full Changelog: MLPerf4.1-llama-v3.11...MLPerf4.1-llama-v3.12.1
MLPerf4.1-v3.12
What's Changed
- Handle bert generator updates by @jh619lee in #70
- int8xint8 lm_head dtype for gptj by @BeomGeunCho in #75
Full Changelog: MLPerf4.1-v3.11...MLPerf4.1-v3.12
MLPerf4.1-llama-v3.11
MLPerf4.1-v3.11
What's Changed
- add new models for gptj by @sunghyuckhong in #46
- BERT furiosa-llm-models helper 로 tracing and model_source 변경 huggingface_rngd_gelu, mlperf_submission by @BeomGeunCho in #47
- Custom dataset for paged attention by @sunghyuckhong in #48
- Add bert ci test by @jh619lee in #49
- GPT-J CI by @sunghyuckhong in #52
- pad with pad token by @jeongin-yun in #55
- Move bert generator init by @jh619lee in #56
- add compact causal mask model. by @BeomGeunCho in #58
- Fix gptj ci by @sunghyuckhong in #57
- Add ci test for causal compact mask bert by @jh619lee in #60
- apply changed get_quant_model to main.py by @BeomGeunCho in #63
- merge splited accuracy log files by @BeomGeunCho in #64
- Model scripts 추가 by @BeomGeunCho in #65
New Contributors
- @jh619lee made their first contribution in #49
- @jeongin-yun made their first contribution in #55
Full Changelog: MLPerf4.1-v3.8...MLPerf4.1-v3.11
MLPerf4.1-v3.8
What's Changed
- remove model name for gptj in inference by @sunghyuckhong in #42
- remove calib argument by @sunghyuckhong in #43
- Calibration with padded inputs by @BeomGeunCho in #44
- add erf gelu models by @BeomGeunCho in #45
Full Changelog: MLPerf4.1-v3.5...MLPerf4.1-v3.8
MLPerf4.1-v3.5
What's Changed
- MCP 변경사항 반영: decode_graph create_quant_sim 동작 시 quantized_prefill_graph 입력하도록 수정함
MLPerf4.1-v3.4
- Ported QuantPagedAttentionGenerator for paged_attention_rope
- It is now possible to generate with paged_attention_rope.GPTJForCausalLM by setting the argument 'model_source' as 'paged_attention_rope' for language/gpt-j/main.py
MLPerf4.1-v3.1
What's Changed
- Generation with pagedattention_concat_rope by @sunghyuckhong in #33
- Generate with preallocated_rope by @sunghyuckhong in #34
- GPT-J preallocated (preallocated_concat_rope.py) 추가 및 paged_attention_concat_rope 업데이트 반영
- paged_attention_concat_rope.py
- Qlevel 4 fx graph 추출 및 generation (greedy_search) 까지 확인
- furiosa-llm-repo에 QuantPagedAttentionGenerator 정의
- furiosa-llm.PagedAttentionGeneator를 torch model이 아닌 fx graph로 수행시키기 위한 구현
- preallocated_concat_rope.py
- Qlevel4 fx graph 변환 완료
- QuanPreallocatedGenerator 구현 완료
- QuantPagedAttentionGenerator와 동일한 역할을 수행함
- 추후 작업: 수정된 rope 로 generate 하는 기능
- paged_attention_concat_rope.py
- Model-compressor-private: torch.dynamo.export 로 graph break 가 발생하던 QLV4EmbeddingMOD 수정됨 (성환님이 확인/반영해주신 issue)
- Furiosa-llm-models: dtype cast 를 지워서 f32에서 i32로 변환하는 부분이 FxToFxp가 아니라 Cast 로 들어오고 있던 상황 해결 (issue slack)
Full Changelog: MLPerf4.1-v2.1...MLPerf4.1-v3.1
MLPerf4.1-v2.1
- Performance
- BERT:
{"exact_match": 83.9546 (100.32%), "f1": 91.05177 (100.19%)}
[W8A8KV8, calibrated and evaluated on A100] - GPT-J original:
rouge1: 43.0717(100.20%) rouge2:20.1398(100.08%) rougeL:30.0108(100.08%), 'gen_len': 3984079(99.18%)
[W8A8KV8 + Smoothquant, calibrated and evaluated on H100]
- BERT:
- Updates
- Changed zero-point shapes from per-head to per-tensor, affecting the matmul operation with zero-point equalizing
- Added emulation in, emulation out to FP32 for bf16 x bf16 dot product.
- Ported GPTJ-paged_attention_concat_rope model