Loading any TheBloke GGUF Model using CTransformers from lang chain result in maximum context length being limited to 512

#1
by KrangPhD - opened

2024-05-28 13:51:08,540 - INFO - Loading LLM TheBloke/LLaMA-Pro-8B-GGUF using mode = GGUF
2024-05-28 13:51:08,835 - INFO - Model's Default Context Length: 2048
2024-05-28 13:51:08,835 - INFO - Using context length: 2048
Fetching 1 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<?, ?it/s]
Fetching 1 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<?, ?it/s]
2024-05-28 13:52:44,008 - INFO - Loading faiss with AVX2 support.
2024-05-28 13:52:44,085 - INFO - Successfully loaded faiss with AVX2 support.
2024-05-28 13:52:53,592 - INFO - Managing returned Sources ... mode = GGUF
2024-05-28 13:52:53,603 - WARNING - Number of tokens (1155) exceeded maximum context length (512).

The model is loaded using:

        try:
            logging.info(f"Loading LLM {self.llm_model} using mode = {self.mode}")
            config = transformers.AutoConfig.from_pretrained(self.llm_model)
            model_context_length = getattr(config, "max_position_embeddings", None)
            if model_context_length:
                logging.info(f"Model's Default Context Length: {model_context_length}")
                # Ensure context_length is within the model's maximum context length
                context_length = min(4096, model_context_length)
                logging.info(f"Using context length: {context_length}")
            else:
                context_length = 2048  # Fallback if not specified

            model = CTransformers(
                model=self.llm_model,
                batch_size=52,
                max_new_tokens=1024,
                context_length=context_length,
                gpu_layers=0
            )
            return model
        except OSError as e:
            logging.error(f"Error loading {self.llm_model} model: {e}")
        except Exception as e:
            logging.error(f"Unexpected error loading {self.llm_model} model: {e}")

and called using:

        generated_text = self.loaded_model.invoke(
            formatted_prompt, 
            **generate_kwargs, 
            do_sample=True,
            stream=True,
            details=True,
            return_full_text=False)

Sign up or log in to comment