# JetMoE-8B-chat: Efficient and High-Performance LLM

Welcome to the official repository of JetMoE-8B-chat, a language model that combines cost-efficiency with high performance, making state-of-the-art language modeling accessible to a broader audience, including academia and small-scale industry players.

## Key Highlights

- **Cost-Effective Training**: Achieved at less than $0.1 million, JetMoE-8B significantly lowers the barrier to entry for training large language models (LLMs), demonstrating that high-quality LLM training can be far more economical than widely assumed.
- **Academia-Friendly**: By relying exclusively on public datasets and open-sourcing our code, JetMoE-8B is highly accessible for educational and research purposes. It is designed to be fine-tuned even on consumer-grade GPUs, making it feasible for most academic labs.
- **Efficiency at Scale**: With only 2.2B active parameters during inference, JetMoE-8B provides an optimal balance between computational cost and performance, outperforming similarly sized models such as Gemma-2B across various benchmarks.

- JetMoE-8B-chat has been evaluated using the MT-Bench. Here is how JetMoE-8B-chat compares with other models:

| Model               | Score     |
|---------------------|-----------|
| GPT-4               | 9.014     |
| GPT-3.5-turbo       | 7.995     |
| Claude-v1           | 7.923     |
| **JetMoE-8B-chat**  | **6.681** |
| Llama-2-13b-chat    | 6.650     |
| Vicuna-13b-v1.3     | 6.413     |
| Wizardlm-13b        | 6.353     |
| Llama-2-7b-chat     | 6.269     |

### Usage

Here's a quick example to get you started with JetMoE-8B-chat:

```python
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Initialize the model and tokenizer
model_name = "jetmoe/jetmoe-8b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, attn_implementation="eager", trust_remote_code=True)
# Check if a GPU is available and move the model to GPU if it is
if torch.cuda.is_available():
    model = model.cuda()
    print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print("GPU is not available, using CPU instead.")
# Encode input context
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)
# If using a GPU, move the input IDs to the GPU
if torch.cuda.is_available():
    input_ids = tokenized_chat.cuda()
# Generate text
output = model.generate(input_ids, max_length=500, num_return_sequences=1, no_repeat_ngram_size=2)
# If the output is on the GPU, move it back to CPU for decoding
if torch.cuda.is_available():
    output = output.cpu()
# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```