File size: 6,011 Bytes
a808e2a
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
a808e2a
4ad5af2
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
abd40c7
 
 
 
 
 
5db860b
a3edb6c
abd40c7
 
 
 
5db860b
abd40c7
 
 
 
 
5db860b
a33ba53
6ad73d9
a33ba53
 
6ad73d9
a33ba53
6ad73d9
 
a33ba53
 
 
6ad73d9
 
 
 
 
 
 
 
 
a33ba53
6ad73d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a33ba53
 
 
 
 
 
abd40c7
a33ba53
 
 
abd40c7
a33ba53
 
 
 
 
 
 
 
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: cc-by-nc-4.0
base_model: Qwen/Qwen2-7B-Instruct
model-index:
- name: Dolphin 
  results: []
tags:
- RAG
- on-device language model
- Retrieval Augmented Generation
inference: false
space: false
spaces: false
language:
- en
---
# Dolphin: Long Context as a New Modality for on-device RAG

<p align="center">
- <a href="https://www.nexaai.com/models" target="_blank">Nexa Model Hub</a>
- <a href="https://arxiv.org/abs/2404.01744" target="_blank">ArXiv</a>
</p>

<p align="center" width="100%">
  <a><img src="logo.png" alt="nexa-octopus" style="width: 30%; min-width: 300px; display: block; margin: auto;"></a>
</p>

## Overview
Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights:
- 🧠 Context as a distinct modality
- 🗜️ Language encoder for context compression
- 🔗 Multimodal techniques applied to language processing
- ⚡ Optimized for energy efficiency and on-device use
- 📜 Specialized for long context understanding

## Model Architecture
Dolphin employs a decoder-decoder framework with two main components:
1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts
2. A larger decoder (7B parameters) for comprehending and generating responses to current queries
3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder.

![Model Architecture](modelstructure.jpg)

## Running the Model
Method 1 : download this repository and run the following commands:
```bash
git lfs install
git clone https://huggingface.co/NexaAIDev/Dolphin
python inference_example.py
```

Method 2 : install `nexaai-dolphin` package
```
pip install nexaai-dolphin
```
Then run the following commands:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
from dolphin.configuration_dolphin import DolphinConfig
from dolphin.modeling_dolphin import DolphinForCausalLM


def inference_instruct(mycontext, question, device="cuda:0"):
    import time
    MEMORY_SIZE = 32
    start_time = time.time()
    generated_token_ids = []
    prompt = f" <context>{question}"
    text_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<context>")]
    input_ids = (
        torch.tensor(
            text_chunks[0] + [-1] * MEMORY_SIZE + text_chunks[1], dtype=torch.long
        )
        .unsqueeze(0)
        .to(device)
    )
    # to process the context
    context_tokenized = tokenizer(
        mycontext + "".join([f"[memory_{i}]" for i in range(MEMORY_SIZE)]),
        return_tensors="pt",
    )
    context_tokenized = {k: v.to(device) for k, v in context_tokenized.items()}
    context_token_count = (context_tokenized["input_ids"]).shape[1] - MEMORY_SIZE
    # We conduct a inference process
    for i in range(context_token_count):
        next_token = (
            model(
                input_ids,
                context_input_ids=context_tokenized["input_ids"],
                context_attention_mask=context_tokenized["attention_mask"],
            )
            .logits[:, -1]
            .argmax(-1)
        )
        if next_token.item() == 151643:
            break
        generated_token_ids.append(next_token.item())
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=-1)
    result = tokenizer.decode(generated_token_ids)
    print(f"Time taken: {time.time() - start_time}")
    return result


if __name__ == "__main__":
    device_name = "cuda:0" if torch.cuda.is_available() else "cpu"
    AutoConfig.register("dolphin", DolphinConfig)
    AutoModelForCausalLM.register(DolphinConfig, DolphinForCausalLM)
    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained('NexaAIDev/Dolphin')
    model = AutoModelForCausalLM.from_pretrained('NexaAIDev/Dolphin', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device_name)
    
    # Run inference example
    mycontext = "Nexa AI is a Cupertino-based company founded in May 2023 that researches and develops models and tools for on-device AI applications. The company is founded by Alex and Zack. The company is known for its Octopus-series models, which rival large-scale language models in capabilities such as function-calling, multimodality, and action-planning, while remaining efficient and compact for edge device deployment. Nexa AI's mission is to advance on-device AI in collaboration with the global developer community. To this end, the company has created an on-device model hub for users to find, share, and collaborate on open-source AI models optimized for edge devices, as well as an SDK for developers to run and deploy AI models locally"
    question = "Who founded Nexa AI?"
    # Pass the context and the correct device string
    result = inference_instruct(mycontext, question, device=device_name)
    print("Result:", result)
```

## Training Process
Dolphin's training involves three stages:
1. Restoration Training: Reconstructing original context from compressed embeddings
2. Continual Training: Generating context continuations from partial compressed contexts
3. Instruction Fine-tuning: Generating responses to queries given compressed contexts

This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses.

## Citation
If you use Dolphin in your research, please cite our paper:

```bibtex
@article{dolphin2024,
  title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models},
  author={[Author Names]},
  journal={arXiv preprint arXiv:[paper_id]},
  year={2024}
}
```

## Contact
For questions or feedback, please [contact us](octopus@nexa4ai.com)