File size: 6,018 Bytes

a808e2a
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
a808e2a
4ad5af2
5db860b
 
 
7c22c22
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a83d63
 
abd40c7
 
 
 
 
5db860b
60e03aa
 
abd40c7
 
 
6a83d63
abd40c7
6a83d63
5db860b
abd40c7
 
 
 
 
5db860b
a33ba53
6ad73d9
a33ba53
 
6ad73d9
a33ba53
6ad73d9
 
a33ba53
 
 
6ad73d9
 
 
 
 
 
 
 
a33ba53
6ad73d9
 
 
 
 
 
 
 
 
 
 
 
 
 
a33ba53
 
 
 
 
 
abd40c7
a33ba53
 
abd40c7
a33ba53
 
 
 
 
 
 
5db860b
 
 
 
 
 
 
 
 
 
 
 
 
 
7c22c22
 
 
 
 
 
 
 
5db860b

---
license: cc-by-nc-4.0
base_model: Qwen/Qwen2-7B-Instruct
model-index:
- name: Dolphin 
  results: []
tags:
- RAG
- on-device language model
- Retrieval Augmented Generation
inference: false
space: false
spaces: false
language:
- en
---
# Dolphin: Long Context as a New Modality for on-device RAG

<p align="center">
- <a href="https://www.nexaai.com/models" target="_blank">Nexa Model Hub</a>
- <a href="https://arxiv.org/pdf/2408.15518" target="_blank">ArXiv</a>
</p>

<p align="center" width="100%">
  <a><img src="logo.png" alt="nexa-octopus" style="width: 30%; min-width: 300px; display: block; margin: auto;"></a>
</p>

## Overview
Dolphin is a novel approach to accelerate language model inference by treating long context as a new modality, similar to image, audio, and video modalities in vision-language models. This innovative method incorporates a language encoder model to encode context information into embeddings, applying multimodal model concepts to enhance the efficiency of language model inference。 Below are model highlights:
- 🧠 Context as a distinct modality
- 🗜️ Language encoder for context compression
- 🔗 Multimodal techniques applied to language processing
- ⚡ Optimized for energy efficiency and on-device use
- 📜 Specialized for long context understanding

## Model Architecture
Dolphin employs a decoder-decoder framework with two main components:
1. A smaller decoder (0.5B parameters) for transforming information from extensive contexts
2. A larger decoder (7B parameters) for comprehending and generating responses to current queries
3. The architecture also includes a projector to align embeddings between the text encoder and the main decoder.

![Model Architecture](modelstructure.jpg)

## Running the Model
### Method 1
download this repository and run the following commands:
```bash
git lfs install
git clone https://huggingface.co/NexaAIDev/Dolphin
python inference_example.py
```

### Method 2
Install `nexaai-dolphin` package
```
pip install nexaai-dolphin
```

Then run the following commands:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
from dolphin.configuration_dolphin import DolphinConfig
from dolphin.modeling_dolphin import DolphinForCausalLM


def inference_instruct(mycontext, question, device="cuda:0"):
    import time
    MEMORY_SIZE = 32
    start_time = time.time()
    generated_token_ids = []
    prompt = f" <context>{question}"
    text_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<context>")]
    input_ids = (
        torch.tensor(
            text_chunks[0] + [-1] * MEMORY_SIZE + text_chunks[1], dtype=torch.long
        )
        .unsqueeze(0)
        .to(device)
    )
    context_tokenized = tokenizer(
        mycontext + "".join([f"[memory_{i}]" for i in range(MEMORY_SIZE)]),
        return_tensors="pt",
    )
    context_tokenized = {k: v.to(device) for k, v in context_tokenized.items()}
    context_token_count = (context_tokenized["input_ids"]).shape[1] - MEMORY_SIZE
    for i in range(context_token_count):
        next_token = (
            model(
                input_ids,
                context_input_ids=context_tokenized["input_ids"],
                context_attention_mask=context_tokenized["attention_mask"],
            )
            .logits[:, -1]
            .argmax(-1)
        )
        if next_token.item() == 151643:
            break
        generated_token_ids.append(next_token.item())
        input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=-1)
    result = tokenizer.decode(generated_token_ids)
    print(f"Time taken: {time.time() - start_time}")
    return result


if __name__ == "__main__":
    device_name = "cuda:0" if torch.cuda.is_available() else "cpu"
    AutoConfig.register("dolphin", DolphinConfig)
    AutoModelForCausalLM.register(DolphinConfig, DolphinForCausalLM)
    tokenizer = AutoTokenizer.from_pretrained('NexaAIDev/Dolphin')
    model = AutoModelForCausalLM.from_pretrained('NexaAIDev/Dolphin', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device_name)
    
    # Run inference example
    mycontext = "Nexa AI is a Cupertino-based company founded in May 2023 that researches and develops models and tools for on-device AI applications. The company is founded by Alex and Zack. The company is known for its Octopus-series models, which rival large-scale language models in capabilities such as function-calling, multimodality, and action-planning, while remaining efficient and compact for edge device deployment. Nexa AI's mission is to advance on-device AI in collaboration with the global developer community. To this end, the company has created an on-device model hub for users to find, share, and collaborate on open-source AI models optimized for edge devices, as well as an SDK for developers to run and deploy AI models locally"
    question = "Who founded Nexa AI?"
    result = inference_instruct(mycontext, question, device=device_name)
    print("Result:", result)
```

## Training Process
Dolphin's training involves three stages:
1. Restoration Training: Reconstructing original context from compressed embeddings
2. Continual Training: Generating context continuations from partial compressed contexts
3. Instruction Fine-tuning: Generating responses to queries given compressed contexts

This multi-stage approach progressively enhances the model's ability to handle long contexts and generate appropriate responses.

## Citation
If you use Dolphin in your research, please cite our paper:

```bibtex
@article{chen2024dolphinlongcontextnew,
      title={Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models}, 
      author={Wei Chen and Zhiyuan Li and Shuo Xin and Yihao Wang},
      year={2024},
      eprint={2408.15518},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.15518}, 
}
```

## Contact
For questions or feedback, please [contact us](octopus@nexa4ai.com)