Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bamba.md #4

Open
wants to merge 1 commit into
base: bamba
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions bamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,13 +106,11 @@ We introduce **Bamba-9B**, an inference-efficient Hybrid Mamba2 model trained by

## Motivation 🌟

Transformer models are increasingly used in real-world applications, but they face GPU memory bandwidth bottlenecks during inference, particularly during per-token decoding, and aggravated in longer context length models. Techniques like lower precision, layer pruning, and compression can alleviate the problem, but do not address the root cause, which is the increasing amount of memory required by the KV-cache as generated sequences get longer. [KV-Cache](https://huggingface.co/docs/transformers/en/kv_cache#best-practices-for-generation-with-cache) is the standard optimization method for autoregressive transformer models.
Transformer models are increasingly used in real-world applications, but they face memory-bandwidth bottlenecks during inference, particularly during per-token decoding in longer context-length models. Techniques like lower precision, layer pruning, and compression can alleviate the problem, but do not address the root cause, which is the increasing amount of memory required by the KV-cache as the context length increases. Emerging architectures such as [Mamba](https://huggingface.co/papers/2312.00752), [Griffin](https://huggingface.co/papers/2402.19427), and [DeltaNet](https://huggingface.co/papers/2406.06484) eliminate this bottleneck by making the KV-cache size constant. The Mamba architecture has gained significant traction in the community in the recent past. For example, [Jamba](https://huggingface.co/papers/2403.19887) and [Samba](https://huggingface.co/papers/2406.07522) interleave Mamba layers with transformer layers and explore the resulting hybrid Mamba models. [Codestral Mamba](https://mistral.ai/news/codestral-mamba/), a pure Mamba2 model, demonstrates state-of-the-art (SOTA) results on coding tasks, while NVIDIA’s hybrid Mamba2 model achieves competitive performance across long-context and traditional LLM benchmarks. Recent innovations, like [Falcon Mamba](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) and [Falcon 3 Mamba](https://huggingface.co/tiiuae/Falcon3-Mamba-7B-Base) achieve SOTA rankings on [Hugging Face leaderboards](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) at the time of their releases.

Emerging architectures such as [Mamba](https://huggingface.co/papers/2312.00752), [Griffin](https://huggingface.co/papers/2402.19427), and [DeltaNet](https://huggingface.co/papers/2406.06484) eliminate this bottleneck by making the KV-cache size constant. The Mamba architecture has gained significant traction in the recent past. For example, [Jamba](https://huggingface.co/papers/2403.19887) and [Samba](https://huggingface.co/papers/2406.07522) interleave Mamba layers with transformer layers. Pure Mamba2 models like [Codestral Mamba](https://mistral.ai/news/codestral-mamba/) demonstrate state-of-the-art (SOTA) results on coding tasks, while NVIDIA’s Mamba2 hybrid achieves competitive performance across long-context and traditional LLM benchmarks. Recent innovations, like [Falcon Mamba](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a), achieve SOTA rankings on [Hugging Face leaderboards](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).
We introduce Bamba-9B, a hybrid Mamba2 model trained on 2.2T tokens, further validating these emerging architectures. This collaboration between IBM, Princeton, CMU, and UIUC provides full training lineage, model checkpoints, and pretraining code to support reproducibility and experimentation. The training dataset of the released checkpoints does not contain any benchmark-aligned instruction data (except FLAN) to preserve extended pretraining and fine-tuning flexibility. Our aim is to showcase the hybrid Mamba2 architecture’s potential by demonstrating strong performance at lower-mid size model scale (7B-10B) and to provide the community with checkpoints that are fully reproducible and trained with open datasets.

We introduce Bamba-9B, a Mamba2 hybrid model trained on 2.2T tokens, further validating these emerging architectures. This collaboration between IBM, Princeton, CMU, and UIUC provides full training lineage, model checkpoints, and pretraining code to support reproducibility and experimentation. The released checkpoints exclude benchmark-aligned instruction data (except FLAN) to preserve fine-tuning flexibility. Our aim is to showcase the Mamba2 hybrid architecture’s potential by demonstrating strong performance at lower-mid size model scale (7B-10B) and providing the community with checkpoints that are fully reproducible and trained with open datasets.

To foster community experimentation, we are also releasing a distributed stateless shuffle data loader and enabling Mamba2 hybrid architecture in open-source libraries like `transformers`, `TRL`, `vLLM`, and `llama.cpp`. We hope these efforts advance the adoption of Mamba architectures, alleviate KV-cache bottlenecks, and close the gap with SOTA open-source models.
To foster community experimentation, we are also releasing a distributed stateless shuffle data loader and enabling hybrid Mamba2 architecture in open-source libraries like `transformers`, `TRL`, `vLLM`, and `llama.cpp`. We hope these efforts advance the adoption of Mamba architectures, alleviate KV-cache bottlenecks, and close the gap with SOTA open-source models.


### Use in transformers 🤗
Expand Down