Transformers documentation
Bamba
Bamba
Overview
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
BambaConfig
| Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings | 
|---|---|---|---|---|---|---|---|---|
| Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | True | 
class transformers.BambaConfig
< source >( vocab_size = 128000 tie_word_embeddings = False hidden_size = 4096 intermediate_size = 14336 num_hidden_layers = 32 num_attention_heads = 32 num_key_value_heads = 8 hidden_act = 'silu' initializer_range = 0.02 rms_norm_eps = 1e-05 use_cache = True num_logits_to_keep = 1 pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 max_position_embeddings = 262144 attention_dropout = 0.0 attn_layer_indices = None mamba_n_heads = 128 mamba_d_head = 'auto' mamba_n_groups = 1 mamba_d_state = 256 mamba_d_conv = 4 mamba_expand = 2 mamba_chunk_size = 256 mamba_conv_bias = True mamba_proj_bias = False **kwargs )
Parameters
-  vocab_size (int, optional, defaults to 128000) — Vocabulary size of the Bamba model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling BambaModel
-  tie_word_embeddings (bool, optional, defaults toFalse) — Whether the model’s input and output word embeddings should be tied. Note that this is only relevant if the model has a output word embedding layer.
-  hidden_size (int, optional, defaults to 4096) — Dimension of the hidden representations.
-  intermediate_size (int, optional, defaults to 14336) — Dimension of the MLP representations.
-  num_hidden_layers (int, optional, defaults to 32) — Number of hidden layers in the Transformer encoder.
-  num_attention_heads (int, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer encoder.
-  num_key_value_heads (int, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to8.
-  hidden_act (strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the decoder.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  rms_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers.
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True.
-  num_logits_to_keep (intorNone, optional, defaults to 1) — Number of prompt logits to calculate during generation. IfNone, all logits will be calculated. If an integer value, only lastnum_logits_to_keeplogits will be calculated. Default is 1 because only the logits of the last prompt token are needed for generation. For long sequences, the logits for the entire sequence may use a lot of memory so, settingnum_logits_to_keep=1will reduce memory footprint significantly.
-  pad_token_id (int, optional, defaults to 0) — The id of the padding token.
-  bos_token_id (int, optional, defaults to 1) — The id of the “beginning-of-sequence” token.
-  eos_token_id (int, optional, defaults to 2) — The id of the “end-of-sequence” token.
-  max_position_embeddings (int, optional, defaults to 262144) — Max cached sequence length for the model
-  attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
-  attn_layer_indices (list, optional) — Specifies the layer indices that will have full attention. Must contain values at most num_hidden_layers.
-  mamba_n_heads (int, optional, defaults to 128) — The number of mamba heads used in the v2 implementation.
-  mamba_d_head (int, optional, defaults to"auto") — Head embeddding dimension size
-  mamba_n_groups (int, optional, defaults to 1) — The number of the mamba groups used in the v2 implementation.
-  mamba_d_state (int, optional, defaults to 256) — The dimension the mamba state space latents
-  mamba_d_conv (int, optional, defaults to 4) — The size of the mamba convolution kernel
-  mamba_expand (int, optional, defaults to 2) — Expanding factor (relative to hidden_size) used to determine the mamba intermediate size
-  mamba_chunk_size (int, optional, defaults to 256) — The chunks in which to break the sequence when doing prefill/training
-  mamba_conv_bias (bool, optional, defaults toTrue) — Flag indicating whether or not to use bias in the convolution layer of the mamba mixer block.
-  mamba_proj_bias (bool, optional, defaults toFalse) — Flag indicating whether or not to use bias in the input and output projections ([“in_proj”, “out_proj”]) of the mamba mixer block
This is the configuration class to store the configuration of a BambaModel. It is used to instantiate a BambaModel model according to the specified arguments, defining the model architecture. Instantiating a configuration with defaults taken from ibm-fms/Bamba-9.8b-2.2T-hf.
The BambaModel is a hybrid mamba2 architecture with SwiGLU. The checkpoints are jointly trained by IBM, Princeton, and UIUC.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
BambaForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B")
tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")
message = ["Mamba is a snake with following properties  "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.models.bamba.modeling_bamba.HybridMambaAttentionDynamicCache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 **kwargs  ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastinput_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (HybridMambaAttentionDynamicCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — A HybridMambaAttentionDynamicCache object containing pre-computed hidden-states (keys and values in the self-attention blocks and convolution and ssm states in the mamba blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding. Key and value cache tensors have shape(batch_size, num_heads, seq_len, head_dim). Convolution and ssm states tensors have shape(batch_size, d_inner, d_conv)and(batch_size, d_inner, d_state)respectively. See theHybridMambaAttentionDynamicCacheclass for more details.If past_key_valuesare used, the user can optionally input only the lastinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  output_router_logits (bool, optional) — Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  Args —
labels (torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].num_logits_to_keep ( intorNone, optional): Calculate logits for the lastnum_logits_to_keeptokens. IfNone, calculate logits for allinput_ids. Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BambaConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head))Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The BambaForCausalLM forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, BambaForCausalLM
>>> model = BambaForCausalLM.from_pretrained("...")
>>> tokenizer = AutoTokenizer.from_pretrained("...")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."This HF implementation is contributed by ani300 and fabianlim.
< > Update on GitHub