avans06 commited on
Commit
43d9f32
1 Parent(s): 40311b7

Fixed an issue where ALMA running on CPU caused AutoGPTQ to throw an "Exllama" error:

Browse files

ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.

https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama

docs/translateModel.md CHANGED
@@ -16,7 +16,7 @@ The required VRAM is provided for reference and may not apply to everyone. If th
16
 
17
  ## M2M100
18
 
19
- 2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "`Beyond English-Centric Multilingual Machine Translation`" ([arXiv:2010.11125](https://arxiv.org/abs/2010.11125)).
20
 
21
  | Name | Parameters | Size | type/quantize | Required VRAM |
22
  |------|------------|------|---------------|---------------|
@@ -40,8 +40,8 @@ NLLB-200 is a multilingual translation model introduced by Meta AI in July 2022.
40
  |------|------------|------|---------------|---------------|
41
  | [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
42
  | [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
43
- | [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | 5.8 GB |
44
- | [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | 13.4 GB |
45
 
46
  ## NLLB-200-CTranslate2
47
 
 
16
 
17
  ## M2M100
18
 
19
+ M2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "`Beyond English-Centric Multilingual Machine Translation`" ([arXiv:2010.11125](https://arxiv.org/abs/2010.11125)).
20
 
21
  | Name | Parameters | Size | type/quantize | Required VRAM |
22
  |------|------------|------|---------------|---------------|
 
40
  |------|------------|------|---------------|---------------|
41
  | [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
42
  | [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
43
+ | [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | 5.8 GB |
44
+ | [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | 13.4 GB |
45
 
46
  ## NLLB-200-CTranslate2
47
 
src/translation/translationModel.py CHANGED
@@ -124,6 +124,19 @@ class TranslationModel:
124
  If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
125
  repetition_penalty (float, optional, defaults to 1.0)
126
  The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  """
128
  try:
129
  print('\n\nLoading model: %s\n\n' % self.modelPath)
@@ -148,7 +161,13 @@ class TranslationModel:
148
  elif "ALMA" in self.modelPath:
149
  self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
150
  self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
151
- self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision)
 
 
 
 
 
 
152
  self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
153
  else:
154
  self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)
 
124
  If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
125
  repetition_penalty (float, optional, defaults to 1.0)
126
  The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
127
+
128
+ [transformers.GPTQConfig]
129
+ use_exllama (bool, optional):
130
+ Whether to use exllama backend. Defaults to True if unset. Only works with bits = 4.
131
+
132
+ [ExLlama]
133
+ ExLlama is a Python/C++/CUDA implementation of the Llama model that is designed for faster inference with 4-bit GPTQ weights (check out these benchmarks).
134
+ The ExLlama kernel is activated by default when you create a [GPTQConfig] object.
135
+ To boost inference speed even further, use the ExLlamaV2 kernels by configuring the exllama_config parameter.
136
+ The ExLlama kernels are only supported when the entire model is on the GPU.
137
+ If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel.
138
+ This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
139
+ https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama
140
  """
141
  try:
142
  print('\n\nLoading model: %s\n\n' % self.modelPath)
 
161
  elif "ALMA" in self.modelPath:
162
  self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
163
  self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
164
+ transModelConfig = transformers.AutoConfig.from_pretrained(self.modelPath)
165
+ if self.device == "cpu":
166
+ transModelConfig.quantization_config["use_exllama"] = False
167
+ self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision, config=transModelConfig)
168
+ else:
169
+ # transModelConfig.quantization_config["exllama_config"] = {"version":2} # After configuring to use ExLlamaV2, VRAM cannot be effectively released, which may be an issue. Temporarily not adopting the V2 version.
170
+ self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision)
171
  self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
172
  else:
173
  self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)