PrunaAI
/

phi-2-GGUF-smashed

GGUF

pruna-ai

Model card Files Files and versions Community

johnrachwanpruna commited on Apr 13

Commit

b42492e

•

1 Parent(s): 438f966

Update README.md

Browse files

Files changed (1) hide show

README.md +61 -61

README.md CHANGED Viewed

@@ -136,67 +136,67 @@ The following clients/libraries will automatically download models for you, prov
   You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) or [ctransformers](https://github.com/marella/ctransformers) libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python.
-  ### How to load this model in Python code, using llama-cpp-python
-  For full documentation, please see: [llama-cpp-python docs](https://abetlen.github.io/llama-cpp-python/).
-  #### First install the package
-  Run one of the following commands, according to your system:
-  ```shell
-  # Base ctransformers with no GPU acceleration
-  pip install llama-cpp-python
-  # With NVidia CUDA acceleration
-  CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
-  # Or with OpenBLAS acceleration
-  CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
-  # Or with CLBLast acceleration
-  CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
-  # Or with AMD ROCm GPU acceleration (Linux only)
-  CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
-  # Or with Metal GPU acceleration for macOS systems only
-  CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
-  # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
-  $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
-  pip install llama-cpp-python
-  ```
-  #### Simple llama-cpp-python example code
-  ```python
-  from llama_cpp import Llama
-  # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
-  llm = Llama(
-    model_path="./phi-2.IQ3_M.gguf",  # Download the model file first
-    n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
-    n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
-    n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
-  )
-  # Simple inference example
-  output = llm(
-    "<s>[INST] {prompt} [/INST]", # Prompt
-    max_tokens=512,  # Generate up to 512 tokens
-    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
-    echo=True        # Whether to echo the prompt
-  )
-  # Chat Completion API
-  llm = Llama(model_path="./phi-2.IQ3_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
-  llm.create_chat_completion(
-      messages = [
-          {"role": "system", "content": "You are a story writing assistant."},
-          {
-              "role": "user",
-              "content": "Write a story about llamas."
-          }
-      ]
-  )
-  ```
 - **Option D** - Running with LangChain

   You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) or [ctransformers](https://github.com/marella/ctransformers) libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python.
+    ### How to load this model in Python code, using llama-cpp-python
+    For full documentation, please see: [llama-cpp-python docs](https://abetlen.github.io/llama-cpp-python/).
+    #### First install the package
+    Run one of the following commands, according to your system:
+    ```shell
+    # Base ctransformers with no GPU acceleration
+    pip install llama-cpp-python
+    # With NVidia CUDA acceleration
+    CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
+    # Or with OpenBLAS acceleration
+    CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
+    # Or with CLBLast acceleration
+    CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
+    # Or with AMD ROCm GPU acceleration (Linux only)
+    CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
+    # Or with Metal GPU acceleration for macOS systems only
+    CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
+    # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
+    $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
+    pip install llama-cpp-python
+    ```
+    #### Simple llama-cpp-python example code
+    ```python
+    from llama_cpp import Llama
+    # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
+    llm = Llama(
+      model_path="./phi-2.IQ3_M.gguf",  # Download the model file first
+      n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
+      n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
+      n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
+    )
+    # Simple inference example
+    output = llm(
+      "<s>[INST] {prompt} [/INST]", # Prompt
+      max_tokens=512,  # Generate up to 512 tokens
+      stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
+      echo=True        # Whether to echo the prompt
+    )
+    # Chat Completion API
+    llm = Llama(model_path="./phi-2.IQ3_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
+    llm.create_chat_completion(
+        messages = [
+            {"role": "system", "content": "You are a story writing assistant."},
+            {
+                "role": "user",
+                "content": "Write a story about llamas."
+            }
+        ]
+    )
+    ```
 - **Option D** - Running with LangChain