At Rapidata, we compared DeepL with LLMs like DeepSeek-R1, Llama, and Mixtral for translation quality using feedback from over 51,000 native speakers. Despite the costs, the performance makes it a valuable investment, especially in critical applications where translation quality is paramount. Now we can say that Europe is more than imposing regulations.
Our dataset, based on these comparisons, is now available on Hugging Face. This might be useful for anyone working on AI translation or language model evaluation.
π First Benchmark of @OpenAI's 4o Image Generation Model!
We've just completed the first-ever (to our knowledge) benchmarking of the new OpenAI 4o image generation model, and the results are impressive!
In our tests, OpenAI 4o image generation absolutely crushed leading competitors, including @black-forest-labs, @google, @xai-org, Ideogram, Recraft, and @deepseek-ai, in prompt alignment and coherence! They hold a gap of more than 20% to the nearest competitor in terms of Bradley-Terry score, the biggest we have seen since the beginning of the benchmark!
The benchmarks are based on 200k human responses collected through our API. However, the most challenging part wasn't the benchmarking itself, but generating and downloading the images:
- 5 hours to generate 1000 images (no API available yet) - Just 10 minutes to set up and launch the benchmark - Over 200,000 responses rapidly collected
While generating the images, we faced some hurdles that meant that we had to leave out certain parts of our prompt set. Particularly we observed that the OpenAI 4o model proactively refused to generate certain images:
Overall, OpenAI 4o stands out significantly in alignment and coherence, especially excelling in certain unusual prompts that have historically caused issues such as: 'A chair on a cat.' See the images for more examples!
1 reply
Β·
reacted to nyuuzyou's
post with ππ€23 days ago
β¨MLLM > R1 Omni by Alibaba Tongyi - 0.5B > Qwen2.5 Omni by Alibaba Qwen - 7B with apache2.0
πΌοΈVideo > CogView-4 by ZhipuAI - Apacha2.0 > HunyuanVideo-I2V by TencentHunyuan > Open Sora2.0 - 11B with Apache2.0 > Stepvideo TI2V by StepFun AI - 30B with MIT license
β‘οΈImage/3D > Hunyuan3D 2mv/2mini (0.6B) by @TencentHunyuan > FlexWorld by ByteDance - MIT license > Qwen2.5-VL-32B-Instruct by Alibaba Qwen - Apache2.0 > Tripo SG (1.5B)/SF by VastAIResearch - MIT license > InfiniteYou by ByteDance
> LHM by Alibaba AIGC team - Apache2.0 > Spatial LM by ManyCore
π§ Reasoning > QwQ-32B by Alibaba Qwen - Apache2.0 > Skywork R1V - 38B with MIT license > RWKV G1 by RWKV AI - 0.1B pure RNN reasoning model with Apache2.0 > Fin R1 by SUFE AIFLM Lab - financial reasoning
π LLM > DeepSeek v3 0324 by DeepSeek -MIT license > Babel by Alibaba DAMO - 9B/83B/25 languages
4 replies
Β·
reacted to jasoncorkill's
post with π₯π₯π23 days ago
Yesterday we published the first large evaluation of the new model, showing that it absolutely leaves the competition in the dust. We have now made the results and data available here! Please check it out and β€οΈ !