ibm-ai-platform
/

llama-13b-accelerator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on May 15, 2024

Commit

ef36b0a

·

verified ·

1 Parent(s): 78a73f1

Update README.md

Files changed (1) hide show

README.md +22 -1

README.md CHANGED Viewed

@@ -33,7 +33,7 @@ Training is light-weight and can be completed in only a few days depending on ba
 _Note: For all samples, your environment must have access to cuda_
-### Production Server Sample
 *To try this out running in a production-like environment, please use the pre-built docker image:*
@@ -101,6 +101,27 @@ python sample_client.py
 _Note: first prompt may be slower as there is a slight warmup time_
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*

 _Note: For all samples, your environment must have access to cuda_
+### Use in IBM Production TGIS
 *To try this out running in a production-like environment, please use the pre-built docker image:*
 _Note: first prompt may be slower as there is a slight warmup time_
+### Use in Huggingface TGI
+#### start the server
+```bash
+model=ibm-fms/llama-13b-accelerator
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
+```
+_note: for tensor parallel, add --num-shard_
+#### make a request
+```bash
+curl 127.0.0.1:8080/generate_stream \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -H 'Content-Type: application/json'
+```
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*