Upload README.md

2e4e42f verified 3 months ago

5.69 kB

	---
	license: other
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- transformers
	- gguf
	- imatrix
	- Llama-3.2-3B
	---
	Quantizations of https://huggingface.co/meta-llama/Llama-3.2-3B


	### Inference Clients/UIs
	* [llama.cpp](https://github.com/ggerganov/llama.cpp)
	* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
	* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
	* [ollama](https://github.com/ollama/ollama)


	---

	# From original readme

	Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:

	- 16B total params, 2.4B active params, scratch training with 5.7T tokens
	- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
	- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs

	DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.

	## 7. How to run locally

	*To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB1 GPU is required.**
	### Inference with Huggingface's Transformers
	You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.

	#### Text Completion
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

	model_name = "deepseek-ai/DeepSeek-V2-Lite"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
	model.generation_config = GenerationConfig.from_pretrained(model_name)
	model.generation_config.pad_token_id = model.generation_config.eos_token_id

	text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	#### Chat Completion
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

	model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
	model.generation_config = GenerationConfig.from_pretrained(model_name)
	model.generation_config.pad_token_id = model.generation_config.eos_token_id

	messages = [
	{"role": "user", "content": "Write a piece of quicksort code in C++"}
	]
	input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
	outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

	result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
	print(result)
	```

	The complete chat template can be found within `tokenizer_config.json` located in the huggingface model repository.

	An example of chat template is as belows:

	```bash
	<｜begin▁of▁sentence｜>User: {user_message_1}

	Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

	Assistant:
	```

	You can also add an optional system message:

	```bash
	<｜begin▁of▁sentence｜>{system_message}

	User: {user_message_1}

	Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

	Assistant:
	```

	### Inference with vLLM (recommended)
	To utilize [vLLM](https://github.com/vllm-project/vllm) for model inference, please merge this Pull Request into your vLLM codebase: https://github.com/vllm-project/vllm/pull/4650.

	```python
	from transformers import AutoTokenizer
	from vllm import LLM, SamplingParams

	max_model_len, tp_size = 8192, 1
	model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
	sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

	messages_list = [
	[{"role": "user", "content": "Who are you?"}],
	[{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
	[{"role": "user", "content": "Write a piece of quicksort code in C++."}],
	]

	prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

	outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

	generated_text = [output.outputs[0].text for output in outputs]
	print(generated_text)
	```

	### LangChain Support
	Since our API is compatible with OpenAI, you can easily use it in [langchain](https://www.langchain.com/).
	Here is an example:

	```
	from langchain_openai import ChatOpenAI
	llm = ChatOpenAI(
	model='deepseek-chat',
	openai_api_key=<your-deepseek-api-key>,
	openai_api_base='https://api.deepseek.com/v1',
	temperature=0.85,
	ma