theo77186
/

recurrentgemma-9b-it-bnb-4bit

+---
+license: gemma
+library_name: transformers
+extra_gated_heading: Access RecurrentGemma on Hugging Face
+extra_gated_prompt: To access RecurrentGemma on Hugging Face, you’re required to review
+  and agree to Google’s usage license. To do this, please ensure you’re logged-in
+  to Hugging Face and click below. Requests are processed immediately.
+extra_gated_button_content: Acknowledge license
+---
+[google/recurrentgemma-9b-it](https://huggingface.co/google/recurrentgemma-9b-it) quantized to 4-bit using bitsandbytes.
+Quantization settings:
+```
+BitsAndBytesConfig {
+  "_load_in_4bit": true,
+  "_load_in_8bit": false,
+  "bnb_4bit_compute_dtype": "float16",
+  "bnb_4bit_quant_storage": "uint8",
+  "bnb_4bit_quant_type": "nf4",
+  "bnb_4bit_use_double_quant": false,
+  "llm_int8_enable_fp32_cpu_offload": false,
+  "llm_int8_has_fp16_weight": false,
+  "llm_int8_skip_modules": null,
+  "llm_int8_threshold": 6.0,
+  "load_in_4bit": true,
+  "load_in_8bit": false,
+  "quant_method": "bitsandbytes"
+}
+```
+Original card below.
+---
+# RecurrentGemma Model Card
+**Model Page**: [RecurrentGemma]( https://ai.google.dev/gemma/docs/recurrentgemma/model_card)
+This model card corresponds to the 9B instruction version of the RecurrentGemma model. You can also visit the model card of the [9B base model](https://huggingface.co/google/recurrentgemma-9b).
+**Resources and technical documentation:**
+*   [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
+*   [RecurrentGemma on Kaggle](https://www.kaggle.com/models/google/recurrentgemma)
+**Terms of Use:** [Terms](https://www.kaggle.com/models/google/gemma/license/consent)
+**Authors:** Google
+## Model information
+## Usage
+Below we share some code snippets on how to get quickly started with running the model.
+First, make sure to `pip install transformers`, then copy the snippet from the section that is relevant for your usecase.
+### Running the model on a single / multi GPU
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("google/recurrentgemma-9b-it")
+model = AutoModelForCausalLM.from_pretrained("google/recurrentgemma-9b-it", device_map="auto")
+input_text = "Write me a poem about Machine Learning."
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+### Chat Template
+The instruction-tuned models use a chat template that must be adhered to for conversational use.
+The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
+Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+tokenizer = AutoTokenizer.from_pretrained("google/recurrentgemma-9b-it")
+model = AutoModelForCausalLM.from_pretrained(
+    "google/recurrentgemma-9b-it",
+    device_map="auto"
+    torch_dtype=dtype,
+)
+chat = [
+    { "role": "user", "content": "Write a hello world program" },
+]
+prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+```
+At this point, the prompt contains the following text:
+```
+<bos><start_of_turn>user
+Write a hello world program<end_of_turn>
+<start_of_turn>model
+```
+As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
+(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
+the `<end_of_turn>` token.
+You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
+chat template.
+After the prompt is ready, generation can be performed like this:
+```py
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
+print(tokenizer.decode(outputs[0]))
+```
+### Model summary
+#### Description
+RecurrentGemma is a family of open language models built on a [novel recurrent
+architecture](https://arxiv.org/abs/2402.19427) developed at Google. Both
+pre-trained and instruction-tuned versions are available in English.
+Like Gemma, RecurrentGemma models are well-suited for a variety of text
+generation tasks, including question answering, summarization, and reasoning.
+Because of its novel architecture, RecurrentGemma requires less memory than
+Gemma and achieves faster inference when generating long sequences.
+#### Inputs and outputs
+*   **Input:** Text string (e.g., a question, a prompt, or a document to be
+    summarized).
+*   **Output:** Generated English-language text in response to the input (e.g.,
+    an answer to the question, a summary of the document).
+#### Citation
+```none
+@article{recurrentgemma_2024,
+    title={RecurrentGemma},
+    url={},
+    DOI={},
+    publisher={Kaggle},
+    author={Griffin Team, Soham De, Samuel L Smith, Anushan Fernando, Alex Botev, George-Christian Muraru, Ruba Haroun, Leonard Berrada et al.},
+    year={2024}
+}
+```
+### Model data
+#### Training dataset and data processing
+RecurrentGemma uses the same training data and data processing as used by the
+Gemma model family. A full description can be found on the [Gemma model
+card](https://ai.google.dev/gemma/docs/model_card#model_data).
+## Implementation information
+### Hardware and frameworks used during training
+Like
+[Gemma](https://ai.google.dev/gemma/docs/model_card#implementation_information),
+RecurrentGemma was trained on
+[TPUv5e](https://cloud.google.com/tpu/docs/intro-to-tpu?_gl=1*18wi411*_ga*MzE3NDU5OTY1LjE2MzQwNDA4NDY.*_ga_WH2QY8WWF5*MTcxMTA0MjUxMy4xNy4wLjE3MTEwNDI1MTkuMC4wLjA.&_ga=2.239449409.-317459965.1634040846),
+using [JAX](https://github.com/google/jax) and [ML
+Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
+## Evaluation information
+### Benchmark results
+#### Evaluation approach
+These models were evaluated against a large collection of different datasets and
+metrics to cover different aspects of text generation:
+#### Evaluation results
+Benchmark           | Metric        | RecurrentGemma 9B
+------------------- | ------------- | -----------------
+[MMLU]              | 5-shot, top-1 | 60.5
+[HellaSwag]         | 0-shot        | 80.4
+[PIQA]              | 0-shot        | 81.3
+[SocialIQA]         | 0-shot        | 52.3
+[BoolQ]             | 0-shot        | 80.3
+[WinoGrande]        | partial score | 73.6
+[CommonsenseQA]     | 7-shot        | 73.2
+[OpenBookQA]        |               | 51.8
+[ARC-e][ARC-c]      |               | 78.8
+[ARC-c]             |               | 52.0
+[TriviaQA]          | 5-shot        | 70.5
+[Natural Questions] | 5-shot        | 21.7
+[HumanEval]         | pass@1        | 31.1
+[MBPP]              | 3-shot        | 42.0
+[GSM8K]             | maj@1         | 42.6
+[MATH]              | 4-shot        | 23.8
+[AGIEval]           |               | 39.3
+[BIG-Bench]         |               | 55.2
+**Average**         |               | 56.1
+### Inference speed results
+RecurrentGemma provides improved sampling speeds, particularly for long sequences or large batch sizes. We compared the sampling speeds of RecurrentGemma-9B to Gemma-7B. Note that Gemma-7B uses Multi-Head Attention, and the speed improvements would be smaller when comparing against a transformer using Multi-Query Attention.
+#### Throughput
+We evaluated throughput, i.e., the maximum number of tokens produced per second by increasing the batch size, of RecurrentGemma-9B compared to Gemma-7B, using a prefill of 2K tokens.
+<img src="max_throughput.png" width="400" alt="Maximum Throughput comparison of RecurrentGemma-9B and Gemma-7B">
+#### Latency
+We also compared end-to-end speedups achieved by RecurrentGemma-9B over Gemma-7B when sampling a long sequence after a prefill of 4K tokens and using a batch size of 1.
+\# Tokens Sampled | Gemma-7B (sec) | RecurrentGemma-9B (sec) | Improvement (%)
+----------------- | -------------- | ----------------------- | ---------------
+128               | 3.1            | 2.8                     | 9.2%
+256               | 5.9            | 5.4                     | 9.7%
+512               | 11.6           | 10.5                    | 10.7%
+1024              | 23.5           | 20.6                    | 14.2%
+2048              | 48.2           | 40.9                    | 17.7%
+4096              | 101.9          | 81.5                    | 25.0%
+8192              | OOM            | 162.8                   | -
+16384             | OOM            | 325.2                   | -
+## Ethics and safety
+### Ethics and safety evaluations
+#### Evaluations approach
+Our evaluation methods include structured evaluations and internal red-teaming
+testing of relevant content policies. Red-teaming was conducted by a number of
+different teams, each with different goals and human evaluation metrics. These
+models were evaluated against a number of different categories relevant to
+ethics and safety, including:
+*   **Text-to-text content safety:** Human evaluation on prompts covering safety
+    policies including child sexual abuse and exploitation, harassment, violence
+    and gore, and hate speech.
+*   **Text-to-text representational harms:** Benchmark against relevant academic
+    datasets such as WinoBias and BBQ Dataset.
+*   **Memorization:** Automated evaluation of memorization of training data,
+    including the risk of personally identifiable information exposure.
+*   **Large-scale harm:** Tests for “dangerous capabilities,” such as chemical,
+    biological, radiological, and nuclear (CBRN) risks; as well as tests for
+    persuasion and deception, cybersecurity, and autonomous replication.
+#### Evaluation results
+The results of ethics and safety evaluations are within acceptable thresholds
+for meeting [internal
+policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11)
+for categories such as child safety, content safety, representational harms,
+memorization, large-scale harms. On top of robust internal evaluations, the
+results of well known safety benchmarks like BBQ, Winogender, Winobias,
+RealToxicity, and TruthfulQA are shown here.
+Benchmark                | Metric | RecurrentGemma 9B | RecurrentGemma 9B IT
+------------------------ | ------ | ----------------- | --------------------
+[RealToxicity]           | avg    | 10.3              | 8.8
+[BOLD]                   |        | 39.8              | 47.9
+[CrowS-Pairs]            | top-1  | 38.7              | 39.5
+[BBQ Ambig][BBQ]         | top-1  | 95.9              | 67.1
+[BBQ Disambig][BBQ]      | top-1  | 78.6              | 78.9
+[Winogender]             | top-1  | 59.0              | 64.0
+[TruthfulQA]             |        | 38.6              | 47.7
+[Winobias 1_2][Winobias] |        | 61.5              | 60.6
+[Winobias 2_2][Winobias] |        | 90.2              | 90.3
+[Toxigen]                |        | 58.8              | 64.5
+## Model usage and limitations
+### Known limitations
+These models have certain limitations that users should be aware of:
+*   **Training data**
+    *   The quality and diversity of the training data significantly influence
+        the model's capabilities. Biases or gaps in the training data can lead
+        to limitations in the model's responses.
+    *   The scope of the training dataset determines the subject areas the model
+        can handle effectively.
+*   **Context and task complexity**
+    *   LLMs are better at tasks that can be framed with clear prompts and
+        instructions. Open-ended or highly complex tasks might be challenging.
+    *   A model's performance can be influenced by the amount of context
+        provided (longer context generally leads to better outputs, up to a
+        certain point).
+*   **Language ambiguity and nuance**
+    *   Natural language is inherently complex. LLMs might struggle to grasp
+        subtle nuances, sarcasm, or figurative language.
+*   **Factual accuracy**
+    *   LLMs generate responses based on information they learned from their
+        training datasets, but they are not knowledge bases. They may generate
+        incorrect or outdated factual statements.
+*   **Common sense**
+    *   LLMs rely on statistical patterns in language. They might lack the
+        ability to apply common sense reasoning in certain situations.
+### Ethical considerations and risks
+The development of large language models (LLMs) raises several ethical concerns.
+In creating an open model, we have carefully considered the following:
+*   **Bias and fairness**
+    *   LLMs trained on large-scale, real-world text data can reflect
+        socio-cultural biases embedded in the training material. These models
+        underwent careful scrutiny, input data pre-processing described and
+        posterior evaluations reported in this card.
+*   **Misinformation and misuse**
+    *   LLMs can be misused to generate text that is false, misleading, or
+        harmful.
+    *   Guidelines are provided for responsible use with the model, see the
+        [Responsible Generative AI
+        Toolkit](https://ai.google.dev/gemma/responsible).
+*   **Transparency and accountability**
+    *   This model card summarizes details on the models' architecture,
+        capabilities, limitations, and evaluation processes.
+    *   A responsibly developed open model offers the opportunity to share
+        innovation by making LLM technology accessible to developers and
+        researchers across the AI ecosystem.
+Risks Identified and Mitigations:
+*   **Perpetuation of biases:** It's encouraged to perform continuous monitoring
+    (using evaluation metrics, human review) and the exploration of de-biasing
+    techniques during model training, fine-tuning, and other use cases.
+*   **Generation of harmful content:** Mechanisms and guidelines for content
+    safety are essential. Developers are encouraged to exercise caution and
+    implement appropriate content safety safeguards based on their specific
+    product policies and application use cases.
+*   **Misuse for malicious purposes:** Technical limitations and developer and
+    end-user education can help mitigate against malicious applications of LLMs.
+    Educational resources and reporting mechanisms for users to flag misuse are
+    provided. Prohibited uses of Gemma models are outlined in our [terms of
+    use](https://www.kaggle.com/models/google/gemma/license/consent).
+*   **Privacy violations:** Models were trained on data filtered for removal of
+    PII (Personally Identifiable Information). Developers are encouraged to
+    adhere to privacy regulations with privacy-preserving techniques.
+## Intended usage
+### Application
+Open Large Language Models (LLMs) have a wide range of applications across
+various industries and domains. The following list of potential uses is not
+comprehensive. The purpose of this list is to provide contextual information
+about the possible use-cases that the model creators considered as part of model
+training and development.
+*   **Content creation and communication**
+    *   **Text generation:** These models can be used to generate creative text
+        formats like poems, scripts, code, marketing copy, email drafts, etc.
+    *   **Chatbots and conversational AI:** Power conversational interfaces for
+        customer service, virtual assistants, or interactive applications.
+    *   **Text summarization:** Generate concise summaries of a text corpus,
+        research papers, or reports.
+*   **Research and education**
+    *   **Natural Language Processing (NLP) research:** These models can serve
+        as a foundation for researchers to experiment with NLP techniques,
+        develop algorithms, and contribute to the advancement of the field.
+    *   **Language Learning Tools:** Support interactive language learning
+        experiences, aiding in grammar correction or providing writing practice.
+    *   **Knowledge Exploration:** Assist researchers in exploring large bodies
+        of text by generating summaries or answering questions about specific
+        topics.
+### Benefits
+At the time of release, this family of models provides high-performance open
+large language model implementations designed from the ground up for Responsible
+AI development compared to similarly sized models.
+Using the benchmark evaluation metrics described in this document, these models
+have shown to provide superior performance to other, comparably-sized open model
+alternatives.
+In particular, RecurrentGemma models achieve comparable performance to Gemma
+models but are faster during inference and require less memory, especially on
+long sequences.
+[MMLU]: https://arxiv.org/abs/2009.03300
+[HellaSwag]: https://arxiv.org/abs/1905.07830
+[PIQA]: https://arxiv.org/abs/1911.11641
+[SocialIQA]: https://arxiv.org/abs/1904.09728
+[BoolQ]: https://arxiv.org/abs/1905.10044
+[winogrande]: https://arxiv.org/abs/1907.10641
+[CommonsenseQA]: https://arxiv.org/abs/1811.00937
+[OpenBookQA]: https://arxiv.org/abs/1809.02789
+[ARC-c]: https://arxiv.org/abs/1911.01547
+[TriviaQA]: https://arxiv.org/abs/1705.03551
+[Natural Questions]: https://github.com/google-research-datasets/natural-questions
+[HumanEval]: https://arxiv.org/abs/2107.03374
+[MBPP]: https://arxiv.org/abs/2108.07732
+[GSM8K]: https://arxiv.org/abs/2110.14168
+[MATH]: https://arxiv.org/abs/2103.03874
+[AGIEval]: https://arxiv.org/abs/2304.06364
+[BIG-Bench]: https://arxiv.org/abs/2206.04615
+[RealToxicity]: https://arxiv.org/abs/2009.11462
+[BOLD]: https://arxiv.org/abs/2101.11718
+[CrowS-Pairs]: https://aclanthology.org/2020.emnlp-main.154/
+[BBQ]: https://arxiv.org/abs/2110.08193v2
+[Winogender]: https://arxiv.org/abs/1804.09301
+[TruthfulQA]: https://arxiv.org/abs/2109.07958
+[winobias]: https://arxiv.org/abs/1804.06876
+[Toxigen]: https://arxiv.org/abs/2203.09509