Metadata-Version: 2.1 Name: ctransformers Version: 0.2.11 Summary: Python bindings for the Transformer models implemented in C/C++ using GGML library. Home-page: https://github.com/marella/ctransformers Author: Ravindra Marella Author-email: mv.ravindra007@gmail.com License: MIT Keywords: ctransformers transformers ai llm Classifier: Development Status :: 1 - Planning Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Education Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Topic :: Scientific/Engineering Classifier: Topic :: Scientific/Engineering :: Mathematics Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence Classifier: Topic :: Software Development Classifier: Topic :: Software Development :: Libraries Classifier: Topic :: Software Development :: Libraries :: Python Modules Description-Content-Type: text/markdown Provides-Extra: tests License-File: LICENSE # [C Transformers](https://github.com/marella/ctransformers) [![PyPI](https://img.shields.io/pypi/v/ctransformers)](https://pypi.org/project/ctransformers/) [![tests](https://github.com/marella/ctransformers/actions/workflows/tests.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/tests.yml) [![build](https://github.com/marella/ctransformers/actions/workflows/build.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml) Python bindings for the Transformer models implemented in C/C++ using [GGML](https://github.com/ggerganov/ggml) library. > Also see [ChatDocs](https://github.com/marella/chatdocs) - [Supported Models](#supported-models) - [Installation](#installation) - [Usage](#usage) - [Hugging Face Hub](#hugging-face-hub) - [LangChain](#langchain) - [GPU](#gpu) - [Documentation](#documentation) - [License](#license) ## Supported Models | Models | Model Type | | :-------------------- | ----------- | | GPT-2 | `gpt2` | | GPT-J, GPT4All-J | `gptj` | | GPT-NeoX, StableLM | `gpt_neox` | | LLaMA | `llama` | | MPT | `mpt` | | Dolly V2 | `dolly-v2` | | StarCoder, StarChat | `starcoder` | | Falcon (Experimental) | `falcon` | ## Installation ```sh pip install ctransformers ``` For GPU (CUDA) support, set environment variable `CT_CUBLAS=1` and install from source using: ```sh CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers ```
Show commands for Windows
On Windows PowerShell run: ```sh $env:CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers ``` On Windows Command Prompt run: ```sh set CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers ```
## Usage It provides a unified interface for all models: ```py from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2') print(llm('AI is going to')) ``` [Run in Google Colab](https://colab.research.google.com/drive/1GMhYMUAv_TyZkpfvUI1NirM8-9mCXQyL) If you are getting `illegal instruction` error, try using `lib='avx'` or `lib='basic'`: ```py llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2', lib='avx') ``` It provides a generator interface for more control: ```py tokens = llm.tokenize('AI is going to') for token in llm.generate(tokens): print(llm.detokenize(token)) ``` It can be used with a custom or Hugging Face tokenizer: ```py from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('gpt2') tokens = tokenizer.encode('AI is going to') for token in llm.generate(tokens): print(tokenizer.decode(token)) ``` It also provides access to the low-level C API. See [Documentation](#documentation) section below. ### Hugging Face Hub It can be used with models hosted on the Hub: ```py llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml') ``` If a model repo has multiple model files (`.bin` files), specify a model file using: ```py llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml', model_file='ggml-model.bin') ``` It can be used with your own models uploaded on the Hub. For better user experience, upload only one model per repo. To use it with your own model, add `config.json` file to your model repo specifying the `model_type`: ```json { "model_type": "gpt2" } ``` You can also specify additional parameters under `task_specific_params.text-generation`. See [marella/gpt-2-ggml](https://huggingface.co/marella/gpt-2-ggml/blob/main/config.json) for a minimal example and [marella/gpt-2-ggml-example](https://huggingface.co/marella/gpt-2-ggml-example/blob/main/config.json) for a full example. ### LangChain It is integrated into LangChain. See [LangChain docs](https://python.langchain.com/docs/ecosystem/integrations/ctransformers). ### GPU > **Note:** Currently only LLaMA models have GPU support. To run some of the model layers on GPU, set the `gpu_layers` parameter: ```py llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-llama.bin', model_type='llama', gpu_layers=50) ``` [Run in Google Colab](https://colab.research.google.com/drive/1Ihn7iPCYiqlTotpkqa1tOhUIpJBrJ1Tp) ## Documentation ### Config | Parameter | Type | Description | Default | | :------------------- | :---------- | :------------------------------------------------------- | :------ | | `top_k` | `int` | The top-k value to use for sampling. | `40` | | `top_p` | `float` | The top-p value to use for sampling. | `0.95` | | `temperature` | `float` | The temperature to use for sampling. | `0.8` | | `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` | | `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` | | `seed` | `int` | The seed value to use for sampling tokens. | `-1` | | `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` | | `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` | | `stream` | `bool` | Whether to stream the generated text. | `False` | | `reset` | `bool` | Whether to reset the model state before generating text. | `True` | | `batch_size` | `int` | The batch size to use for evaluating tokens. | `8` | | `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` | | `context_length` | `int` | The maximum context length to use. | `-1` | | `gpu_layers` | `int` | The number of layers to run on GPU. | `0` | > **Note:** Currently only LLaMA and MPT models support the `context_length` parameter and only LLaMA models support the `gpu_layers` parameter. ### class `AutoModelForCausalLM` --- #### classmethod `AutoModelForCausalLM.from_pretrained` ```python from_pretrained( model_path_or_repo_id: str, model_type: Optional[str] = None, model_file: Optional[str] = None, config: Optional[ctransformers.hub.AutoConfig] = None, lib: Optional[str] = None, local_files_only: bool = False, **kwargs ) → LLM ``` Loads the language model from a local file or remote repo. **Args:** - `model_path_or_repo_id`: The path to a model file or directory or the name of a Hugging Face Hub model repo. - `model_type`: The model type. - `model_file`: The name of the model file in repo or directory. - `config`: `AutoConfig` object. - `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`. - `local_files_only`: Whether or not to only look at local files (i.e., do not try to download the model). **Returns:** `LLM` object. ### class `LLM` ### method `LLM.__init__` ```python __init__( model_path: str, model_type: str, config: Optional[ctransformers.llm.Config] = None, lib: Optional[str] = None ) ``` Loads the language model from a local file. **Args:** - `model_path`: The path to a model file. - `model_type`: The model type. - `config`: `Config` object. - `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`. --- ##### property LLM.config The config object. --- ##### property LLM.context_length The context length of model. --- ##### property LLM.embeddings The input embeddings. --- ##### property LLM.eos_token_id The end-of-sequence token. --- ##### property LLM.logits The unnormalized log probabilities. --- ##### property LLM.model_path The path to the model file. --- ##### property LLM.model_type The model type. --- ##### property LLM.vocab_size The number of tokens in vocabulary. --- #### method `LLM.detokenize` ```python detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes] ``` Converts a list of tokens to text. **Args:** - `tokens`: The list of tokens. - `decode`: Whether to decode the text as UTF-8 string. **Returns:** The combined text of all tokens. --- #### method `LLM.embed` ```python embed( input: Union[str, Sequence[int]], batch_size: Optional[int] = None, threads: Optional[int] = None ) → List[float] ``` Computes embeddings for a text or list of tokens. > **Note:** Currently only LLaMA models support embeddings. **Args:** - `input`: The input text or list of tokens to get embeddings for. - `batch_size`: The batch size to use for evaluating tokens. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` **Returns:** The input embeddings. --- #### method `LLM.eval` ```python eval( tokens: Sequence[int], batch_size: Optional[int] = None, threads: Optional[int] = None ) → None ``` Evaluates a list of tokens. **Args:** - `tokens`: The list of tokens to evaluate. - `batch_size`: The batch size to use for evaluating tokens. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` --- #### method `LLM.generate` ```python generate( tokens: Sequence[int], top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, reset: Optional[bool] = None ) → Generator[int, NoneType, NoneType] ``` Generates new tokens from a list of tokens. **Args:** - `tokens`: The list of tokens to generate tokens from. - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` - `batch_size`: The batch size to use for evaluating tokens. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `reset`: Whether to reset the model state before generating text. Default: `True` **Returns:** The generated tokens. --- #### method `LLM.is_eos_token` ```python is_eos_token(token: int) → bool ``` Checks if a token is an end-of-sequence token. **Args:** - `token`: The token to check. **Returns:** `True` if the token is an end-of-sequence token else `False`. --- #### method `LLM.reset` ```python reset() → None ``` Resets the model state. --- #### method `LLM.sample` ```python sample( top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None ) → int ``` Samples a token from the model. **Args:** - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` **Returns:** The sampled token. --- #### method `LLM.tokenize` ```python tokenize(text: str) → List[int] ``` Converts a text into list of tokens. **Args:** - `text`: The text to tokenize. **Returns:** The list of tokens. --- #### method `LLM.__call__` ```python __call__( prompt: str, max_new_tokens: Optional[int] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, stop: Optional[Sequence[str]] = None, stream: Optional[bool] = None, reset: Optional[bool] = None ) → Union[str, Generator[str, NoneType, NoneType]] ``` Generates text from a prompt. **Args:** - `prompt`: The prompt to generate text from. - `max_new_tokens`: The maximum number of new tokens to generate. Default: `256` - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` - `batch_size`: The batch size to use for evaluating tokens. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `stop`: A list of sequences to stop generation when encountered. Default: `None` - `stream`: Whether to stream the generated text. Default: `False` - `reset`: Whether to reset the model state before generating text. Default: `True` **Returns:** The generated text. ## License [MIT](https://github.com/marella/ctransformers/blob/main/LICENSE)