whisper-webui-translate

Running

App Files Files Community

avans06 commited on Dec 14, 2023

Commit

43d9f32

•

1 Parent(s): 40311b7

Fixed an issue where ALMA running on CPU caused AutoGPTQ to throw an "Exllama" error:

Browse files

ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.

https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama

Files changed (2) hide show

docs/translateModel.md +3 -3
src/translation/translationModel.py +20 -1

docs/translateModel.md CHANGED Viewed

@@ -16,7 +16,7 @@ The required VRAM is provided for reference and may not apply to everyone. If th
 ## M2M100
-2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "`Beyond English-Centric Multilingual Machine Translation`" ([arXiv:2010.11125](https://arxiv.org/abs/2010.11125)).
 | Name | Parameters | Size | type/quantize | Required VRAM |
 |------|------------|------|---------------|---------------|
@@ -40,8 +40,8 @@ NLLB-200 is a multilingual translation model introduced by Meta AI in July 2022.
 |------|------------|------|---------------|---------------|
 | [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
 | [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
-| [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | 5.8 GB |
-| [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | 13.4 GB |
 ## NLLB-200-CTranslate2

 ## M2M100
+M2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "`Beyond English-Centric Multilingual Machine Translation`" ([arXiv:2010.11125](https://arxiv.org/abs/2010.11125)).
 | Name | Parameters | Size | type/quantize | Required VRAM |
 |------|------------|------|---------------|---------------|
 |------|------------|------|---------------|---------------|
 | [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
 | [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
+| [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.8 GB |
+| [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | ≈13.4 GB |
 ## NLLB-200-CTranslate2

src/translation/translationModel.py CHANGED Viewed

@@ -124,6 +124,19 @@ class TranslationModel:
             If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
         repetition_penalty (float, optional, defaults to 1.0)
             The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
         """
         try:
             print('\n\nLoading model: %s\n\n' % self.modelPath)
@@ -148,7 +161,13 @@ class TranslationModel:
             elif "ALMA" in self.modelPath:
                 self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
                 self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
-                self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision)
                 self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
             else:
                 self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)

             If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
         repetition_penalty (float, optional, defaults to 1.0)
             The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
+        [transformers.GPTQConfig]
+        use_exllama (bool, optional):
+            Whether to use exllama backend. Defaults to True if unset. Only works with bits = 4.
+        [ExLlama]
+            ExLlama is a Python/C++/CUDA implementation of the Llama model that is designed for faster inference with 4-bit GPTQ weights (check out these benchmarks).
+            The ExLlama kernel is activated by default when you create a [GPTQConfig] object.
+            To boost inference speed even further, use the ExLlamaV2 kernels by configuring the exllama_config parameter.
+            The ExLlama kernels are only supported when the entire model is on the GPU.
+            If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel.
+            This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
+            https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama
         """
         try:
             print('\n\nLoading model: %s\n\n' % self.modelPath)
             elif "ALMA" in self.modelPath:
                 self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
                 self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
+                transModelConfig = transformers.AutoConfig.from_pretrained(self.modelPath)
+                if self.device == "cpu":
+                    transModelConfig.quantization_config["use_exllama"] = False
+                    self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision, config=transModelConfig)
+                else:
+                    # transModelConfig.quantization_config["exllama_config"] = {"version":2} # After configuring to use ExLlamaV2, VRAM cannot be effectively released, which may be an issue. Temporarily not adopting the V2 version.
+                    self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision)
                 self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
             else:
                 self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)