JosephusCheung/GuanacoOnConsumerHardware

Try Multimodal version with Colab Free T4 demo:

This repository is for Guanaco model with 4-bit quantized weights. The model benefits from two novel techniques introduced by GPTQ: quantizing columns in order of decreasing activation size and performing sequential quantization within a single Transformer block. These innovations enable compact, consumer-level multilingual models to function effectively.

The Guanaco model aims to provide a minimal multilingual conversational model capable of handling simple Q&A interactions, with a comprehensive understanding of grammar, rich vocabulary, and stability similar to that of large-scale language models, for use as a human-computer interface.

However, due to the limitations of consumer hardware, it is impossible for models with the performance level of ChatGPT3.5/GPT4 to run independently. Our model, with a reduced number of parameters, can still operate on older hardware generations, requiring less than 6GB of memory after 4-bit quantization. The only constraint is the speed, which depends on the actual hardware configuration.

Instead of competing with large models like ChatGPT, we pursue a different approach: a functionally complete language model without any inherent knowledge or computational ability. We achieve this by integrating APIs for knowledge acquisition (e.g., querying online resources like Wikipedia or utilizing Wolfram|Alpha for calculations) to provide accurate information to users, rather than relying on the model's learning and understanding capabilities. The primary goal is to create a stable large-scale language model for human-computer interaction.

An example of this approach is processing long articles or PDF documents. With traditional ChatGPT3.5 API's single-threaded operation, text must be divided into segments and matched with user input, which is inefficient. Our minimal multilingual model can analyze text sentence by sentence, generating multiple human-readable questions for each sentence. It can then establish logical connections between these questions using a Question-Answer tree structure and algorithms like PageRank to provide users with answers based on preliminary logical analysis.

Furthermore, our model can be applied to summarizing web search results. These use-cases, which are challenging for large models due to cost, scale, and frequency limitations, are more feasible on local, small-scale, consumer-level hardware. This direction represents the next step in our efforts.