|
--- |
|
license: other |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- transformers |
|
- gguf |
|
- imatrix |
|
- Llama-3.2-3B |
|
--- |
|
Quantizations of https://huggingface.co/meta-llama/Llama-3.2-3B |
|
|
|
|
|
### Inference Clients/UIs |
|
* [llama.cpp](https://github.com/ggerganov/llama.cpp) |
|
* [KoboldCPP](https://github.com/LostRuins/koboldcpp) |
|
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui) |
|
* [ollama](https://github.com/ollama/ollama) |
|
|
|
|
|
--- |
|
|
|
# From original readme |
|
|
|
Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out: |
|
|
|
- 16B total params, 2.4B active params, scratch training with 5.7T tokens |
|
- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks |
|
- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs |
|
|
|
DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. |
|
|
|
## 7. How to run locally |
|
|
|
**To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.** |
|
### Inference with Huggingface's Transformers |
|
You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference. |
|
|
|
#### Text Completion |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig |
|
|
|
model_name = "deepseek-ai/DeepSeek-V2-Lite" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda() |
|
model.generation_config = GenerationConfig.from_pretrained(model_name) |
|
model.generation_config.pad_token_id = model.generation_config.eos_token_id |
|
|
|
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100) |
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(result) |
|
``` |
|
|
|
#### Chat Completion |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig |
|
|
|
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda() |
|
model.generation_config = GenerationConfig.from_pretrained(model_name) |
|
model.generation_config.pad_token_id = model.generation_config.eos_token_id |
|
|
|
messages = [ |
|
{"role": "user", "content": "Write a piece of quicksort code in C++"} |
|
] |
|
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") |
|
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100) |
|
|
|
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True) |
|
print(result) |
|
``` |
|
|
|
The complete chat template can be found within `tokenizer_config.json` located in the huggingface model repository. |
|
|
|
An example of chat template is as belows: |
|
|
|
```bash |
|
<|begin▁of▁sentence|>User: {user_message_1} |
|
|
|
Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2} |
|
|
|
Assistant: |
|
``` |
|
|
|
You can also add an optional system message: |
|
|
|
```bash |
|
<|begin▁of▁sentence|>{system_message} |
|
|
|
User: {user_message_1} |
|
|
|
Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2} |
|
|
|
Assistant: |
|
``` |
|
|
|
### Inference with vLLM (recommended) |
|
To utilize [vLLM](https://github.com/vllm-project/vllm) for model inference, please merge this Pull Request into your vLLM codebase: https://github.com/vllm-project/vllm/pull/4650. |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from vllm import LLM, SamplingParams |
|
|
|
max_model_len, tp_size = 8192, 1 |
|
model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) |
|
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id]) |
|
|
|
messages_list = [ |
|
[{"role": "user", "content": "Who are you?"}], |
|
[{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}], |
|
[{"role": "user", "content": "Write a piece of quicksort code in C++."}], |
|
] |
|
|
|
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list] |
|
|
|
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params) |
|
|
|
generated_text = [output.outputs[0].text for output in outputs] |
|
print(generated_text) |
|
``` |
|
|
|
### LangChain Support |
|
Since our API is compatible with OpenAI, you can easily use it in [langchain](https://www.langchain.com/). |
|
Here is an example: |
|
|
|
``` |
|
from langchain_openai import ChatOpenAI |
|
llm = ChatOpenAI( |
|
model='deepseek-chat', |
|
openai_api_key=<your-deepseek-api-key>, |
|
openai_api_base='https://api.deepseek.com/v1', |
|
temperature=0.85, |
|
ma |