Text Generation
Transformers
Safetensors
English
mistral
conversational
Eval Results
Inference Endpoints
text-generation-inference

Hallucination

#1
by cnmoro - opened

Hey!
I have also been messing around for a while with smaller models - I have also tried finetuning some models that you have posted, for some specific tasks.
Have you found a real application for them yet?
I've noticed that any model that has less than 1B params tends to hallucinate a lot.
Just curious :)

@cnmoro small models are fun to play with for expanding storylines, completing drafts and/or roleplay. Qwen 0.5B being a small model is a beast at function calling and had a pass rate of 77% on function calling eval dataset.

Hey, @cnmoro ! I'm also following your projects!

As @aloobun mentioned, those small models are great for storytelling/RP. I also find them good at zero-shot Q&A over specific topics.

But during chats I also see all of them hallucinating a lot. However, I did put some of those in production, with a different purpose:

In MiniSearch, when not running on a WebGPU-supported web browser, it uses these small models via Transformers.js.
Currently, it's running these models: onnx-Pythia-31M-Chat-v1, onnx-Smol-Llama-101M-Chat-v1, onnx-Llama-160M-Chat-v1. [Reference]

I opted to use those because, when using Transformers.js v2, the inference speed on browsers slows down significantly on models larger than that. [More info here].

It's also worth noting that smartphones' web browsers can handle 30M ONNX models without requiring quantization. This is important because when quantization is applied to small models, it can lead to significant differences in the output. For example, an unquantized 30M model can perform better than a quantized 100M model.

And although it is possible to convert them to GGUF and run llama.cpp through WASM (LLM.js & llama-cpp-wasm), the ONNX runtime remains the fastest option for inference on mobile devices.

I believe that the combination of different styles from the dataset mix in this model has increased the chance of hallucination. To verify it, I've fine-tuned Minueza-32M-Base with a single, large dataset, and the results were less prone to hallucinations. This new model, Felladrin/Minueza-32M-UltraChat, can be tested in Models Playground.

By the way, I forgot to mention earlier that there is another way to further reduce hallucinations: Contrastive Search. This strategy is great for question-and-answer or instruction-response scenarios. However, it may have a negative impact on multi-turn chats.

Sign up or log in to comment