How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)

Community Article Published August 14, 2024

In this tutorial, we'll walk you through the process of setting up and using Ollama for private model inference on a VM with GPU, either on your local machine or a rented VM from Vast.aior Runpod.io. Ollama allows you to run models privately, ensuring data security and faster inference times thanks to the power of GPUs. By leveraging a GPU-powered VM, you can significantly improve the performance and efficiency of your model inference tasks.

Outline

  1. Set up a VM with GPU on Vast.ai
  2. Start Jupyter Terminal
  3. Install Ollama
  4. Run Ollama Serve
  5. Test Ollama with a model
  6. (Optional) using your own model

Setting Up a VM with GPU on Vast.ai

1. Create a VM with GPU: --- Visit Vast.ai to create your VM. --- Choose a VM with at least 30 GB of storage to accommodate the models. This ensures you have enough space for installation and model storage. --- Select a VM that costs less than $0.30 per hour to keep the setup cost-effective.

2. Start Jupyter Terminal: --- Once your VM is up and running, start Jupyter and open a terminal within it.

Downloading and Running Ollama

  1. Start Jupyter Terminal: --- Once your VM is up and running, start Jupyter and open a terminal within it. This is the easiest method to get started. --- Alternatively, you can use SSH on your local VM, for example with VSCode, but you will need to create an SSH key to use it.

  1. Install Ollama: --- Open the terminal in Jupyter and run the following command to install Ollama:
bash curl -fsSL https://ollama.com/install.sh | sh

2. Run Ollama Serve: --- After installation, start the Ollama service by running:

bash ollama serve &

Ensure there are no GPU errors. If there are issues, the response will be slow when interacting with the model.

3. Test Ollama with a Model: --- Test the setup by running a sample model like Mistral:

bash ollama run mistral

You can now start chatting with the model to ensure everything is working correctly.

Optional (Check GPU usage)

Check GPU Utilization: --- During the inference (last step), check if the GPU is being utilized by running the following command:bash nvidia-smi - Ensure that the memory utilization is greater than 0%. This indicates that the GPU is being used for the inference process.

Using Your Own Hugging Face Model with Ollama

1. Install Hugging Face CLI: --- If you want to use your own model from Hugging Face, first install the Hugging Face CLI. Here we will use an example of a fine tuned Mistral model TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf

2. Download Your Model: --- Download your desired model from Hugging Face. For example, to download a fine-tuned Mistral model:

pip3 install huggingface-hub# Try with my custom model for fine tuned Mistral
huggingface-cli download TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

3. Create a Model File: --- Create a model config file ***Modelfile ***with the following content:

FROM em_german_mistral_v01.Q4_K_M.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0

# # set the system message
# SYSTEM """
# You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
# """

4. Instruct Ollama to Create the Model: --- Create the custom model using Ollama with the command:

ollama create -f mymodel Modelfile

5. Run Your Custom Model: --- Run your custom model using:

ollama run mymodel

By following these steps, you can effectively utilize Ollama for private model inference on a VM with GPU, ensuring secure and efficient operations for your machine learning projects.

Happy prompting!