我问问

by DORA1222 - opened Aug 2, 2023

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-28

Files changed (3) hide show

README.md +0 -6
modeling_chatglm.py +8 -21
tokenization_chatglm.py +1 -1

README.md CHANGED Viewed

@@ -16,13 +16,7 @@ tags:
     👋 Join our <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> and <a href="https://github.com/THUDM/ChatGLM-6B/blob/main/resources/WECHAT.md" target="_blank">WeChat</a>
 </p>
-## 更新/Update
-- 我们优化了KV Cache的存储方式，减少了显存碎片的产生。基于优化后的代码，模型可以在约**20G显存**的情况下处理32K长度的上下文（FP/BF16格式）。
-- We have optimized the storage method of the KV Cache, reducing the generation of memory fragmentation. Based on the optimized code, the model can process a context length of 32K under approximately **20G** of memory (FP/BF16 format).
 ## 介绍
 ChatGLM**2**-6B-32K在[ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)的基础上进一步强化了对于长文本的理解能力，能够更好的处理最多32K长度的上下文。具体地，我们基于[位置插值](https://arxiv.org/abs/2306.15595)（Positional Interpolation）的方法对位置编码进行了更新，并在对话阶段使用 32K 的上下文长度训练。在实际的使用中，如果您面临的上下文长度基本在 **8K 以内**，我们推荐使用[ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)；如果您需要处理**超过 8K** 的上下文长度，我们推荐使用ChatGLM2-6B-32K。
 ChatGLM**2**-6B-32K是开源中英双语对话模型 [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) 的加长版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM**2**-6B-32k 引入了如下新特性：

     👋 Join our <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> and <a href="https://github.com/THUDM/ChatGLM-6B/blob/main/resources/WECHAT.md" target="_blank">WeChat</a>
 </p>
 ## 介绍
 ChatGLM**2**-6B-32K在[ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)的基础上进一步强化了对于长文本的理解能力，能够更好的处理最多32K长度的上下文。具体地，我们基于[位置插值](https://arxiv.org/abs/2306.15595)（Positional Interpolation）的方法对位置编码进行了更新，并在对话阶段使用 32K 的上下文长度训练。在实际的使用中，如果您面临的上下文长度基本在 **8K 以内**，我们推荐使用[ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)；如果您需要处理**超过 8K** 的上下文长度，我们推荐使用ChatGLM2-6B-32K。
 ChatGLM**2**-6B-32K是开源中英双语对话模型 [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) 的加长版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM**2**-6B-32k 引入了如下新特性：

modeling_chatglm.py CHANGED Viewed

@@ -413,10 +413,7 @@ class SelfAttention(torch.nn.Module):
             key_layer = torch.cat((cache_k, key_layer), dim=0)
             value_layer = torch.cat((cache_v, value_layer), dim=0)
         if use_cache:
-            if kv_cache is None:
-                kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)), dim=1)
-            else:
-                kv_cache = (key_layer, value_layer)
         else:
             kv_cache = None
@@ -615,8 +612,12 @@ class GLMTransformer(torch.nn.Module):
         if not kv_caches:
             kv_caches = [None for _ in range(self.num_layers)]
         presents = () if use_cache else None
-        if self.training:
-            use_cache = False
         all_self_attentions = None
         all_hidden_states = () if output_hidden_states else None
@@ -644,15 +645,7 @@ class GLMTransformer(torch.nn.Module):
                 )
             hidden_states, kv_cache = layer_ret
             if use_cache:
-                # token by token decoding, use tuple format
-                if kv_caches[0] is not None:
-                    presents = presents + (kv_cache,)
-                # prefilling in decoding, use tensor format to save cuda memory
-                else:
-                    if len(presents) == 0:
-                        presents = kv_cache
-                    else:
-                        presents = torch.cat((presents, kv_cache), dim=0)
         if output_hidden_states:
             all_hidden_states = all_hidden_states + (hidden_states,)
@@ -837,12 +830,6 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
             inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
             kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
         )
-        if presents is not None and type(presents) is torch.Tensor:
-            presents = presents.split(1, dim=0)
-            presents = list(presents)
-            presents = [list(x.squeeze(0).split(1, dim=0)) for x in presents]
-            presents = [tuple([x.squeeze(0) for x in y]) for y in presents]
-            presents = tuple(presents)
         if not return_dict:
             return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)

             key_layer = torch.cat((cache_k, key_layer), dim=0)
             value_layer = torch.cat((cache_v, value_layer), dim=0)
         if use_cache:
+            kv_cache = (key_layer, value_layer)
         else:
             kv_cache = None
         if not kv_caches:
             kv_caches = [None for _ in range(self.num_layers)]
         presents = () if use_cache else None
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
         all_self_attentions = None
         all_hidden_states = () if output_hidden_states else None
                 )
             hidden_states, kv_cache = layer_ret
             if use_cache:
+                presents = presents + (kv_cache,)
         if output_hidden_states:
             all_hidden_states = all_hidden_states + (hidden_states,)
             inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
             kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
         )
         if not return_dict:
             return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)

tokenization_chatglm.py CHANGED Viewed

@@ -66,6 +66,7 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
     model_input_names = ["input_ids", "attention_mask", "position_ids"]
     def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
         self.name = "GLMTokenizer"
         self.vocab_file = vocab_file
@@ -75,7 +76,6 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
             "<eos>": self.tokenizer.eos_id,
             "<pad>": self.tokenizer.pad_id
         }
-        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
     def get_command(self, token):
         if token in self.special_tokens:

     model_input_names = ["input_ids", "attention_mask", "position_ids"]
     def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs):
+        super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)
         self.name = "GLMTokenizer"
         self.vocab_file = vocab_file
             "<eos>": self.tokenizer.eos_id,
             "<pad>": self.tokenizer.pad_id
         }
     def get_command(self, token):
         if token in self.special_tokens: