Instructions to use grounded-ai/phi3-hallucination-judge-merge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use grounded-ai/phi3-hallucination-judge-merge with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="grounded-ai/phi3-hallucination-judge-merge", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("grounded-ai/phi3-hallucination-judge-merge", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("grounded-ai/phi3-hallucination-judge-merge", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use grounded-ai/phi3-hallucination-judge-merge with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "grounded-ai/phi3-hallucination-judge-merge" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grounded-ai/phi3-hallucination-judge-merge", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/grounded-ai/phi3-hallucination-judge-merge
- SGLang
How to use grounded-ai/phi3-hallucination-judge-merge with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "grounded-ai/phi3-hallucination-judge-merge" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grounded-ai/phi3-hallucination-judge-merge", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "grounded-ai/phi3-hallucination-judge-merge" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grounded-ai/phi3-hallucination-judge-merge", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use grounded-ai/phi3-hallucination-judge-merge with Docker Model Runner:
docker model run hf.co/grounded-ai/phi3-hallucination-judge-merge
Merged Model Performance
This repository contains our hallucination evaluation PEFT adapter model.
Hallucination Detection Metrics
Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:
precision recall f1-score support
0 0.85 0.71 0.77 100
1 0.75 0.87 0.81 100
accuracy 0.79 200
macro avg 0.80 0.79 0.79 200
weighted avg 0.80 0.79 0.79 200
Model Usage
For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):
def format_input(reference, query, response):
prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
A hallucination occurs when the response is coherent but factually incorrect or nonsensical
outputs that are not grounded in the provided context.
You are given the following information:
####INFO####
[Knowledge]: {reference}
[User Input]: {query}
[Model Response]: {response}
####END INFO####
Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
"""
return input
text = format_input(query='Based on the follwoing
<context>Walrus are the largest mammal</context>
answer the question
<query> What is the best PC?</query>',
response='The best PC is the mac')
messages = [
{"role": "user", "content": text}
]
pipe = pipeline(
"text-generation",
model=base_model,
model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 2,
"return_full_text": False,
"temperature": 0.01,
"do_sample": True,
}
output = pipe(messages, **generation_args)
print(f'Hallucination: {output[0]['generated_text'].strip().lower()}')
# Hallucination: yes
Comparison with Other Models
We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:
| Model | Precision | Recall | F1 |
|---|---|---|---|
| Our Merged Model | 0.75 | 0.87 | 0.81 |
| GPT-4 | 0.93 | 0.72 | 0.82 |
| GPT-4 Turbo | 0.97 | 0.70 | 0.81 |
| Gemini Pro | 0.89 | 0.53 | 0.67 |
| GPT-3.5 | 0.89 | 0.65 | 0.75 |
| GPT-3.5-turbo-instruct | 0.89 | 0.80 | 0.84 |
| Palm 2 (Text Bison) | 1.00 | 0.44 | 0.61 |
| Claude V2 | 0.80 | 0.95 | 0.87 |
As shown in the table, our merged model achieves one of the highest F1 scores of 0.81, outperforming several other state-of-the-art language models on this hallucination detection task.
We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
Citations: Scores from arize/phoenix
Training Data
@misc{HaluEval, author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen }, title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models}, year = {2023}, journal={arXiv preprint arXiv:2305.11747}, url={https://arxiv.org/abs/2305.11747} }
Framework versions
- PEFT 0.11.1
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.2
- Tokenizers 0.19.1
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 10
- training_steps: 150
Framework versions
- PEFT 0.11.1
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.2
- Tokenizers 0.19.1
- Downloads last month
- 4