Text Generation
PEFT
Safetensors
File size: 4,793 Bytes
77105f1
 
 
834f2ff
b4a7a77
e541155
b4a7a77
e541155
 
77105f1
 
e541155
77105f1
e541155
77105f1
e541155
 
48f9221
e541155
 
349dfff
 
 
 
77105f1
e541155
 
 
77105f1
e541155
 
 
 
77105f1
e541155
77105f1
e541155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77105f1
e541155
77105f1
e541155
 
 
 
 
 
 
 
 
 
 
 
 
77105f1
b4a7a77
 
 
77105f1
e541155
 
b4a7a77
77105f1
e541155
 
 
77105f1
e541155
b4a7a77
77105f1
e541155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77105f1
e541155
77105f1
e541155
 
 
 
 
 
 
 
 
 
 
 
 
 
fa9d7a5
e541155
 
 
 
 
 
 
3dc569e
e541155
b4a7a77
 
e541155
b4a7a77
 
e541155
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
base_model: microsoft/phi-2
library_name: peft
license: mit
tags:
- text-generation
pipeline_tag: text-generation
datasets:
- NuclearAi/HyperThink-Mini-50K
---

# phi2-memory-deeptalks

A **LoRA adapter** for the Phi-2 language model, fine-tuned on short conversational snippets to provide **short-term memory** in dialogue. This adapter enables your assistant to recall and leverage the last few user/assistant turns—without full fine-tuning of the 2.7 B-parameter base model.

<p align="center">
  <a href="https://huggingface.co/spaces/sourize/DeepTalks">
    🔗 Live Demo on Hugging Face Spaces
  </a>
</p>
<p align="center">
  ⏳ It takes time to generate responses since it's running on the CPU free tier
</p>


---

## 🚀 Overview

**phi2-memory-deeptalks** injects lightweight, low-rank corrections into the attention and MLP layers of `microsoft/phi-2`.  
- **Size:** ~6 M trainable parameters (≈ 0.2 % of the base model)  
- **Base:** Phi-2 (2.7 B parameters)  
- **Adapter:** Low-Rank Adaptation (LoRA) via the [PEFT](https://github.com/huggingface/peft) library  

---

## 📦 Model Details

### Architecture & Adapter Configuration

- **Base model:** `microsoft/phi-2` (causal-LM)  
- **LoRA rank (r):** 4  
- **Modules wrapped:**  
  - Attention projections: `q_proj`, `k_proj`, `v_proj`, `dense`  
  - MLP layers: `fc1`, `fc2`  
- **LoRA hyperparameters:**  
  - `lora_alpha`: 32  
  - `lora_dropout`: 0.05  
  - **Trainable params:** ~5.9 M  

### Training Data & Preprocessing

- **Dataset:** HyperThink-Mini 50 K (7 % used)  
- **Prompt format:**  
  ```text
  ### Human:
  <user message>

  ### Assistant:
  <assistant response>
  ```
- **Tokenization:** Truncated/padded to 256 tokens, `labels = input_ids`  
- **Optimizer:** AdamW (PyTorch), FP16 on GPU  
- **Batching:** `per_device_train_batch_size=1` + `gradient_accumulation_steps=8`  
- **Epochs:** 3  
- **Checkpointing:** Save every 500 steps; final adapter weights in `adapter_model.safetensors`  

---

## 🎯 Evaluation

- **Training loss (step 500):** ~1.08  
- **Validation loss:** ~1.10  
- **Qualitative:**  
  - Improved recall of the last 2–4 turns in dialogue  
  - Maintains base Phi-2 fluency on general language  

---

## 🔧 Usage

Load the adapter into your Phi-2 model with just a few lines:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, LoraConfig

# 1) Load base
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")

# 2) Apply LoRA adapter
peft_config = LoraConfig.from_pretrained("sourize/phi2-memory-deeptalks")
model = PeftModel.from_pretrained(model, peft_config)

# 3) (Optional) Resize embeddings
model.base_model.resize_token_embeddings(len(tokenizer))

# 4) Generate
prompt = "### Human:\nHello, how are you?\n\n### Assistant:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

---

## ⚙️ Inference & Deployment

- **Preferred:** GPU (NVIDIA-CUDA) for sub-second latency  
- **CPU-only:** ~7–10 min per response (large model!)  
- **Hugging Face Inference API:**  
  ```bash
  curl -X POST \
    -H "Authorization: Bearer $HF_TOKEN" \
    -H "Content-Type: application/json" \
    https://api-inference.huggingface.co/pipeline/text-generation/sourize/phi2-memory-deeptalks \
    -d '{
      "inputs": "Hello, how are you?",
      "parameters": {
        "max_new_tokens": 64,
        "do_sample": true,
        "temperature": 0.7,
        "top_p": 0.9,
        "return_full_text": false
      }
    }'
  ```

---

## 💡 Use Cases & Limitations

- **Ideal for:**  
  - Short back-and-forth chats (2–4 turns)  
  - Chatbots that need to “remember” very recent context  
- **Not suited for:**  
  - Long-term memory or document-level retrieval  
  - High-volume production on CPU (too slow)  

---

## 📖 Further Reading

- **Live Demo:** [DeepTalks Space](https://huggingface.co/spaces/sourize/DeepTalks)  
- **Blog post:** [DeepTalks Blog](https://sourish.xyz/thoughts/deeptalks-your-personal-ai-companion) 
- **PEFT & LoRA:** [PEFT GitHub](https://github.com/huggingface/peft) | [LoRA Paper](https://arxiv.org/abs/2106.09685)  

---

## 🔖 Citation

```bibtex
@misc{sourize_phi2_memory_deeptalks,
  title        = {phi2-memory-lora: LoRA adapter for Phi-2 with short-term conversational memory},
  author       = {Sourish},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/sourize/phi2-memory-deeptalks}},
  license      = {MIT}
}
```

---

*Questions or feedback? Please open an issue on the [repository](https://huggingface.co/sourize/phi2-memory-deeptalks).*
```