[sync] sync npu branch with main #5278

ver217 · 2024-01-18T04:06:40Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* add safetybench and cvalues(responsibility) eval dataset * Modify code according to review suggestions --------- Co-authored-by: Orion-Zheng <[email protected]>

…ech#5122)

…eline for bert (hpcaitech#5088) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (hpcaitech#4883) * [shardformer]: fix interleaved pipeline for bert model (hpcaitech#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (hpcaitech#5093) * Add Mistral support for Shardformer (hpcaitech#5103) * [shardformer] add tests to mistral (hpcaitech#5105) --------- Co-authored-by: Pengtai Xu <[email protected]> Co-authored-by: ppt0011 <[email protected]> Co-authored-by: flybird11111 <[email protected]> Co-authored-by: eric8607242 <[email protected]>

* [doc] add moe news * [doc] add moe news * [doc] add moe news

hpcaitech#5127) Co-authored-by: github-actions <[email protected]>

hpcaitech#5125) Co-authored-by: github-actions <[email protected]>

hpcaitech#5118) Co-authored-by: github-actions <[email protected]>

…pcaitech#5135) * fix 3d checkpoint load when booster boost without optimizer fix 3d checkpoint load when booster boost without optimizer * test ci * revert ci * fix fix

) * refactor server and webui & add new feature * add requirements * modify readme and ui

* fix doc * modify doc

fix

…#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by: Xu Yuanchen <[email protected]> Co-authored-by: Camille Zhong <[email protected]>

…ech#5150) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix

* fix pangu api * add comment

…el (hpcaitech#5169) * Support GSM, Data Leakage Evaluation and Tensor Parallel * remove redundant code and update inference.py in examples/gpt_evaluation --------- Co-authored-by: Xu Yuanchen <[email protected]>

* fix aaa fix fix fix * fix * fix * test ci * fix ci fix * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (hpcaitech#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by: Xu Yuanchen <[email protected]> Co-authored-by: Camille Zhong <[email protected]> * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * fix ci * fix ci --------- Co-authored-by: Yuanchen <[email protected]> Co-authored-by: Xu Yuanchen <[email protected]> Co-authored-by: Camille Zhong <[email protected]>

Co-authored-by: Xu Yuanchen <[email protected]>

* fix aaa fix fix fix * fix * fix * test ci * fix ci fix * update pytorch version in documents

…eaved pp (hpcaitech#5134) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin

Co-authored-by: Xu <[email protected]>

…comptibility checking (hpcaitech#5207) * doc/update requirements-test.txt * update torch-cuda compatibility check

support linear accumulation fusion support linear accumulation fusion fix

…ech#5201) * fix: remove drop last in val & test dataloader * feat: add run_forward_only, support arbitrary bs * chore: modify ci script

…pcaitech#5214) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test

fix-test fix-test

…#5224) * update readme * update readme * update link * update * update readme * update * update * update * update title * update example * update example * fix content * add conclusion * add license * update * update * update version * fix minor

* Update README.md * Update README.md

…h#5231) * Make leaderboard format more unifeid and good-looking * Update README.md * Update README.md

* [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B

hpcaitech#5235) Co-authored-by: github-actions <[email protected]>

* [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release

* A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <[email protected]>

* [workflow] fixed build CI * polish * polish * polish * polish * polish

* [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test

* [ci] fixed ddp test * polish

* fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <[email protected]>

* [doc] fix annotation display * [doc] fix llama2 doc

* fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check

* [workflow] fixed oom tests * polish * polish * polish

* fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <[email protected]>

…pcaitech#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix

…llelism (hpcaitech#5230)

digger-yu and others added 30 commits November 24, 2023 19:15

fix typo change lazy_iniy to lazy_init (hpcaitech#5099)

2bdf76f

[nfc] fix typo change directoty to directory (hpcaitech#5111)

d5661f0

[FEATURE] Add Safety Eval Datasets to ColossalEval (hpcaitech#5095)

7b789f4

* add safetybench and cvalues(responsibility) eval dataset * Modify code according to review suggestions --------- Co-authored-by: Orion-Zheng <[email protected]>

[hotfix] fixed memory usage of shardformer module replacement (hpcait…

126cf18

…ech#5122)

[doc] add moe news (hpcaitech#5128)

177c79f

* [doc] add moe news * [doc] add moe news * [doc] add moe news

[doc] updated paper citation (hpcaitech#5131)

2899cfd

fix typo change JOSNL TO JSONL etc. (hpcaitech#5116)

9110406

[format] applied code formatting on changed files in pull request 5088 (

d10ee42

hpcaitech#5127) Co-authored-by: github-actions <[email protected]>

[format] applied code formatting on changed files in pull request 5124 (

9b36640

hpcaitech#5125) Co-authored-by: github-actions <[email protected]>

[format] applied code formatting on changed files in pull request 5115 (

f6731db

hpcaitech#5118) Co-authored-by: github-actions <[email protected]>

[plugin]fix 3d checkpoint load when booster boost without optimizer. (h…

2a2ec49

…pcaitech#5135) * fix 3d checkpoint load when booster boost without optimizer fix 3d checkpoint load when booster boost without optimizer * test ci * revert ci * fix fix

[ColossalQA] refactor server and webui & add new feature (hpcaitech#5138

c7fd9a5

) * refactor server and webui & add new feature * add requirements * modify readme and ui

[doc] fix colossalqa document (hpcaitech#5146)

368b5e3

* fix doc * modify doc

fix (hpcaitech#5158)

3dbbf83

fix

[gemini] hotfix NaN loss while using Gemini + tensor_parallel (hpcait…

21aa5de

…ech#5150) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix

[colossalqa] fix pangu api (hpcaitech#5170)

b07a6f4

* fix pangu api * add comment

Fix ColossalEval (hpcaitech#5186)

3ff60d1

Co-authored-by: Xu Yuanchen <[email protected]>

[doc] update pytorch version in documents. (hpcaitech#5177)

681d9b1

* fix aaa fix fix fix * fix * fix * test ci * fix ci fix * update pytorch version in documents

polish readme in application/chat (hpcaitech#5194)

af95267

Improve logic for selecting metrics (hpcaitech#5196)

eae01b6

Co-authored-by: Xu <[email protected]>

[doc] Update required third-party library list for testing and torch …

64519eb

…comptibility checking (hpcaitech#5207) * doc/update requirements-test.txt * update torch-cuda compatibility check

support linear accumulation fusion (hpcaitech#5199)

02d2328

support linear accumulation fusion support linear accumulation fusion fix

[pipeline]: support arbitrary batch size in forward_only mode (hpcait…

3c0d82b

…ech#5201) * fix: remove drop last in val & test dataloader * feat: add run_forward_only, support arbitrary bs * chore: modify ci script

[pipeline]: add p2p fallback order and fix interleaved pp deadlock (h…

d799a30

…pcaitech#5214) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test

[devops] update torch versoin in ci (hpcaitech#5217)

7f3400b

flybird11111 and others added 24 commits January 3, 2024 14:26

fix-test (hpcaitech#5210)

365671b

fix-test fix-test

fix flash attn (hpcaitech#5209)

451e914

[nfc] fix typo colossalai/shardformer/ (hpcaitech#5133)

b0b53a1

[doc] Update README.md of Colossal-LLAMA2 (hpcaitech#5233)

915b465

* Update README.md * Update README.md

[doc] Make leaderboard format more uniform and good-looking (hpcaitec…

ce65127

…h#5231) * Make leaderboard format more unifeid and good-looking * Update README.md * Update README.md

[doc] add Colossal-LLaMA-2-13B (hpcaitech#5234)

b9b32b1

* [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B * [doc] add Colossal-LLaMA-2-13B

[format] applied code formatting on changed files in pull request 5234 (

4fb4a22

hpcaitech#5235) Co-authored-by: github-actions <[email protected]>

[doc] SwiftInfer release (hpcaitech#5236)

7bc6969

* [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release * [doc] SwiftInfer release

[pipeline] A more general _communicate in p2p (hpcaitech#5062)

d565df3

* A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <[email protected]>

[doc] fix typo in Colossal-LLaMA-2/README.md (hpcaitech#5247)

41e52c1

[workflow] fixed build CI (hpcaitech#5240)

edf94a3

* [workflow] fixed build CI * polish * polish * polish * polish * polish

[ci] fixed booster test (hpcaitech#5251)

d5eeeb1

* [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test

[ci] fixed ddp test (hpcaitech#5254)

2b83418

* [ci] fixed ddp test * polish

fix typo in applications/ColossalEval/README.md (hpcaitech#5250)

756c400

[ci] fix shardformer tests. (hpcaitech#5255)

e830ef9

* fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <[email protected]>

[doc] fix doc typo (hpcaitech#5256)

c174c4f

* [doc] fix annotation display * [doc] fix llama2 doc

[hotfix]: add pp sanity check and fix mbs arg (hpcaitech#5268)

ef4f0ee

* fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check

[workflow] fixed incomplete bash command (hpcaitech#5272)

04244aa

[workflow] fixed oom tests (hpcaitech#5275)

d69cd2e

* [workflow] fixed oom tests * polish * polish * polish

[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (hpcaitech#5276)

2a0558d

* fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <[email protected]>

[shardformer] hybridparallelplugin support gradients accumulation. (h…

46e0916

…pcaitech#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix

[hotfix] Fix ShardFormer test execution path when using sequence para…

5d9a0ae

…llelism (hpcaitech#5230)

Merge branch 'main' into sync/npu

1484693

ver217 requested a review from a team as a code owner January 18, 2024 04:06

FrankLeeeee merged commit d66e698 into hpcaitech:feature/npu Jan 18, 2024
15 of 22 checks passed

ver217 deleted the sync/npu branch January 18, 2024 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] sync npu branch with main #5278

[sync] sync npu branch with main #5278

ver217 commented Jan 18, 2024

[sync] sync npu branch with main #5278

[sync] sync npu branch with main #5278

Conversation

ver217 commented Jan 18, 2024

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?