Apel-sin commited on
Commit
02471c8
·
1 Parent(s): 60deeba

add measurement.json

Browse files
Files changed (2) hide show
  1. README.md +137 -0
  2. measurement.json +0 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ base_model: t-tech/T-lite-it-1.0
5
+ ---
6
+ # T-lite-it-1.0
7
+
8
+ **🚨 T-lite is designed for further fine-tuning and is not intended as a ready-to-use conversational assistant. Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.**
9
+
10
+
11
+ ## Description
12
+
13
+ T-lite-it-1.0 is a model built upon the Qwen 2.5 model family and incorporates both continual pre-training and alignment techniques.
14
+
15
+ ### 📚 Dataset
16
+
17
+ Pre-training Stage 1:
18
+ 100B tokens, consisting of diverse Russian data from Common Crawl, books, code, and proprietary datasets, mixed with re-played English data (English added as it is the primary language of the base model).
19
+
20
+ Pre-training Stage 2:
21
+ 40B tokens, a mix of instruction and pre-training data.
22
+
23
+ Supervised Fine-Tuning (SFT):
24
+ 1B tokens, a mix of diverse instruction data.
25
+
26
+ Preference Tuning:
27
+ 1B tokens, training the model to be helpful.
28
+
29
+ ## 📊 Benchmarks
30
+
31
+ | Benchmark | T-lite-it-1.0 | Qwen-2.5-7B-Instruct | GigaChat Pro 1.0.26.15 | RuAdapt-Qwen-7B-Instruct-v1 | gemma-2-9b-it |
32
+ |------------------------------------------------|:-------------:|:--------------------:|:----------------------:|:---------------------------:|:--------------|
33
+ | [MERA](https://mera.a-ai.ru) | **0.552** | 0.482 | 0.512 | 0.468 | 0.505 |
34
+ | [MaMuRaMu](https://mera.a-ai.ru/ru/tasks/22) | **0.775** | 0.711 | 0.77 | 0.7 | 0.724 |
35
+ | ruMMLU-PRO | **0.497** | 0.481 | - | 0.448 | 0.405 |
36
+ | ruGSM8K | **0.856** | 0.832 | 0.752 | 0.795 | 0.823 |
37
+ | ruMATH | **0.679** | 0.671 | 0.418 | 0.607 | 0.473 |
38
+ | ruMBPP | **0.693** | 0.685 | 0.412 | 0.696 | 0.63 |
39
+ | [ruCodeEval](https://mera.a-ai.ru/ru/tasks/23) | 0.082 / 0.168 / 0.226 | 0.025 / 0.071 / 0.098 | 0.056 / 0.068 / 0.073 | 0.018 / 0.064 / 0.11 | **0.215 / 0.494 / 0.561** |
40
+ | Arena-Hard-Ru | **64.38** | 54.29 | - | 52.77 | 47.83 |
41
+ | MT Bench Ru | 7.87 | 7.33 | **8.21** | 7.62 | 7.4 |
42
+ | Alpaca Eval Ru | **39.61** | 25.61 | 29.83 | 28.43 | 36.87 |
43
+
44
+ Detailed evaluation results can be found in our [habr post](https://habr.com/ru/companies/tbank/articles/865582/)
45
+
46
+
47
+ ## 👨‍💻 Examples of usage
48
+
49
+ ### HF Usage
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelForCausalLM
53
+ import torch
54
+ torch.manual_seed(42)
55
+
56
+ model_name = "t-tech/T-lite-it-1.0"
57
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
58
+ model = AutoModelForCausalLM.from_pretrained(
59
+ model_name,
60
+ torch_dtype="auto",
61
+ device_map="auto"
62
+ )
63
+
64
+ prompt = "Напиши стих про машинное обучение"
65
+ messages = [
66
+ {"role": "system", "content": "Ты T-lite, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
67
+ {"role": "user", "content": prompt}
68
+ ]
69
+ text = tokenizer.apply_chat_template(
70
+ messages,
71
+ tokenize=False,
72
+ add_generation_prompt=True
73
+ )
74
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
75
+
76
+ generated_ids = model.generate(
77
+ **model_inputs,
78
+ max_new_tokens=256
79
+ )
80
+ generated_ids = [
81
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
82
+ ]
83
+
84
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
85
+
86
+ print(response)
87
+ ```
88
+
89
+ Output:
90
+ ```
91
+ В мире данных, где цифры танцуют,
92
+ Машинное обученье — ведущий вальс.
93
+ Алгоритмы учатся, как дети,
94
+ На примерах, как на сказочных страницах.
95
+
96
+ Они ищут закономерности в потоках,
97
+ Как мудрецы в древних свитках.
98
+ С каждым шагом всё точнее предсказания,
99
+ Вот так, словно волшебство, оживает.
100
+
101
+ Обучаясь на ошибках, они растут,
102
+ Из простых моделей в сложные формы.
103
+ Каждый новый пример — как новая строка,
104
+ В книге знаний, что не знает конца.
105
+
106
+ Не бойтесь перемен, ведь это — путь,
107
+ Который ведёт к будущему, светлому и новому.
108
+ Машинное обученье — наш проводник,
109
+ В этом мире, где технологии царят.
110
+ ```
111
+
112
+ ### VLLM Usage
113
+
114
+ ```python
115
+ from transformers import AutoTokenizer
116
+ from vllm import LLM, SamplingParams
117
+
118
+ model_name = "t-tech/T-lite-it-1.0"
119
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
120
+ llm = LLM(model=model_name, max_model_len=8192)
121
+ sampling_params = SamplingParams(temperature=0.7,
122
+ repetition_penalty=1.05,
123
+ top_p=0.8, top_k=70)
124
+
125
+ prompt = "Напиши стих про машинное обучение"
126
+ messages = [
127
+ {"role": "system", "content": "Ты T-lite, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
128
+ {"role": "user", "content": prompt}
129
+ ]
130
+
131
+ prompt_token_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
132
+
133
+ outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
134
+
135
+ generated_text = [output.outputs[0].text for output in outputs]
136
+ print(generated_text)
137
+ ```
measurement.json ADDED
The diff for this file is too large to render. See raw diff