File size: 4,258 Bytes
89831d5 a519f10 5034d89 89831d5 a519f10 e7ae5bd a519f10 0a2b9c6 aae4619 1f9a0a0 a519f10 0a2b9c6 ebffa9b 0a2b9c6 a519f10 d32df80 a519f10 e7ae5bd a519f10 0a2b9c6 a519f10 0a2b9c6 a519f10 169bba5 a519f10 169bba5 a519f10 0a2b9c6 e7ae5bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: apache-2.0
tags:
- snowflake
- arctic
- moe
---
## Model Details
Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch by the Snowflake AI
Research Team. We are releasing model checkpoints for both the base and instruct-tuned versions of
Arctic under an Apache-2.0 license. This means you can use them freely in your own research,
prototypes, and products. Please see our blog
[Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/)
for more information on Arctic and links to other relevant resources such as our series of cookbooks
covering topics around training your own custom MoE models, how to produce high-quality training data,
and much more.
* [Arctic-Base](https://huggingface.co/Snowflake/snowflake-arctic-base/)
* [Arctic-Instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct/)
For the latest details about Snowflake Arctic including tutorials, etc. please refer to our github repo:
* https://github.com/Snowflake-Labs/snowflake-arctic
**Model developers** Snowflake AI Research Team
**License** Apache-2.0
**Input** Models input text only.
**Output** Models generate text and code only.
**Model Release Date** April, 24th 2024.
## Model Architecture
Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B
total and 17B active parameters chosen using a top-2 gating. For more details about Arctic's model
Architecture, training process, data, etc. [see our series of cookbooks](https://www.snowflake.com/en/data-cloud/arctic/cookbook/).
## Usage
As of 4/24/2024 we are actively working with the maintainers of `transformers` to include the Arctic
model implementation. Until this support is released please follow these instructions to get the
required dependencies for using Arctic:
```python
pip install git+https://github.com/Snowflake-Labs/transformers.git@arctic
```
Arctic leverages several features from [DeepSpeed](https://github.com/microsoft/DeepSpeed), you will need to
install the latest version of DeepSpeed to get all of these required features:
```python
pip install "deepspeed>=0.14.2"
```
### Inference examples
Due to the model size we recommend using a single 8xH100 instance from your
favorite cloud provider such as: AWS [p5.48xlarge](https://aws.amazon.com/ec2/instance-types/p5/),
Azure [ND96isr_H100_v5](https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series), etc.
In this example we are using FP8 quantization provided by DeepSpeed in the backend, we can also use FP6
quantization by specifying `q_bits=6` in the `ArcticQuantizationConfig` config. The `"150GiB"` setting
for max_memory is required until we can get DeepSpeed's FP quantization supported natively as a [HFQuantizer](https://huggingface.co/docs/transformers/main/en/hf_quantizer#build-a-new-hfquantizer-class) which we
are actively working on.
```python
import os
# enable hf_transfer for faster ckpt download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.arctic.configuration_arctic import ArcticQuantizationConfig
tokenizer = AutoTokenizer.from_pretrained("Snowflake/snowflake-arctic-instruct")
quant_config = ArcticQuantizationConfig(q_bits=8)
model = AutoModelForCausalLM.from_pretrained(
"Snowflake/snowflake-arctic-instruct",
low_cpu_mem_usage=True,
device_map="auto",
ds_quantization_config=quant_config,
max_memory={i: "150GiB" for i in range(8)},
torch_dtype=torch.bfloat16)
messages = [{"role": "user", "content": "What is 1 + 1 "}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
```
The Arctic github page has additional code snippets and examples around running inference:
* Example with pure-HF: https://github.com/Snowflake-Labs/snowflake-arctic/blob/main/inference
* Tutorial using vLLM: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/inference/vllm |