elyza
/

Llama-3-ELYZA-JP-8B-GGUF

Inference Endpoints

Model card Files Files and versions Community

tyoyo commited on Jun 25, 2024

Commit

05aa8dc

·

verified ·

1 Parent(s): bad0264

Update README.md

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -27,9 +27,9 @@ The following table shows the performance degradation due to quantization:
 | Model | ELYZA-tasks-100 GPT4 score |
 | :-------------------------------- | ---: |
-| Llama-3-ELYZA-JP-8B               | 3.655 |
-| Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M) | 3.57  |
-| Llama-3-ELYZA-JP-8B-AWQ           | 3.39  |
 ## Use with llama.cpp
@@ -90,6 +90,10 @@ There are various desktop applications that can handle GGUF models, but here we
 - **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
 - **(For Developers) Starting an API Server**: Click `<->` in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.
 ## Developers
 Listed in alphabetical order.

 | Model | ELYZA-tasks-100 GPT4 score |
 | :-------------------------------- | ---: |
+| [Llama-3-ELYZA-JP-8B](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B)               | 3.655 |
+| [Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M)](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B-GGUF) | 3.57  |
+| [Llama-3-ELYZA-JP-8B-AWQ](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B-AWQ)           | 3.39  |
 ## Use with llama.cpp
 - **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
 - **(For Developers) Starting an API Server**: Click `<->` in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.
+![lmstudio-demo](./lmstudio-demo.gif)
+This demo showcases Llama-3-ELYZA-JP-8B-GGUF running smoothly on a MacBook Pro (M1 Pro), achieving an inference speed of approximately 20 tokens per second.
 ## Developers
 Listed in alphabetical order.