File size: 2,719 Bytes
eb422dd
 
 
 
29fde44
 
 
133020d
9cc1923
 
31ea432
 
 
9cc1923
31ea432
133020d
2a0bdf4
 
 
 
 
29fde44
2a0bdf4
 
 
 
 
 
 
 
29fde44
4c4e3ee
9a52c30
3629e1f
 
 
f426f57
3629e1f
214d245
d873890
81e76a0
 
 
2a0bdf4
4c4e3ee
2a0bdf4
 
 
 
 
 
 
c8531bf
6357c0a
2a0bdf4
 
6357c0a
 
4c4e3ee
affab6f
9cc1923
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: apache-2.0
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/643197ac288c9775673a01e9/w-lgOpASM1DMl2PO0kdFy.png)

## Introduction

APUS-xDAN-4.0-MOE is a transformer-based decoder-only language model, developed on a vast corpus of data to ensure robust performance.

This is an enhanced MoE (Mixture of Experts) model built on top of the continued pre-training enhanced LlaMA architecture, 
further optimized with human-enhanced feedback algorithms to improve reasoning, mathematical, and logical capabilities during inference.

For more comprehensive information, please visit our blog post and GitHub repository.
https://github.com/shootime2021/APUS-xDAN-4.0-moe

# Model Details
APUS-xDAN-4.0-MOE leverages the innovative Mixture of Experts (MoE) architecture, incorporating components from dense language models. Specifically, it inherits its capabilities from the highly performant xDAN-L2 Series. With a total of 136 billion parameters, of which 30 billion are activated during runtime, APUS-xDAN-4.0-MOE demonstrates unparalleled efficiency.
Through advanced quantization techniques, our open-source version occupies a mere 42GB, making it seamlessly compatible with consumer-grade GPUs like the 4090 and 3090.
The following specifications:

- **Parameters:** 136B
- **Architecture:** Mixture of 4 Experts (MoE)
- **Experts Utilization:** 2 experts used per token
- **Layers:** 60
- **Attention Heads:** 56 for queries, 8 for keys/values
- **Embedding Size:** 7,168
- **Additional Features:**
  - Rotary embeddings (RoPE)
  - Supports activation sharding and 1.5bit~4bit quantization
- **Maximum Sequence Length (context):** 32,768 tokens 
## Usage

| Model       | Quantized | Size   | Context     | Hardware Requirement     |
|-------------|-----------|--------|--------------------------| --------------------------|
| APUS-xDAN4.0-MoE-0402.Q2_K.gguf       | Q2_K        | 39G  | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.IQ3_XXS.gguf      | IQ3_XXS        | 41G | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.Q3_K_M_Matrix.gguf      | Q3_K_M        | 51G | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.Q4_K_M.gguf | Q4_K_M       | 64G  | 32k | 3x24G GPU memory |
| APUS-xDAN4.0-MoE-0402  |          |    |   |    |



### Initial 
```python

git clone https://github.com/ggerganov/llama.cpp.git
make LLAMA_CUDA=1
```
### Interactive Chat
```python

./main -m APUS-xDAN4.0-MoE-0402.Q2_K.gguf  \
--prompt "You are a helpful assistant named APUS-xDAN4.0 MoE." --chatml \
--interactive \
--temp 0.7 \
--ctx-size 4096 (32k)

```
License

APUS-xDAN-4.0-MOE is distributed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.