DilatoMX
/

G11n_GenAI_Assesment_Model

Eval Results (legacy)

Model card Files Files and versions

Comparative es-MX Performance Evaluation

#4

by AndresCas - opened May 26

Dilato Infotech Limited org May 26

Evaluated Models

ChatGPT (GPT-4o → GPT-5.5)
DeepSeek V3
Gemini 2.5 Flash → Gemini 3 Flash
Copilot (GPT-4o-based → GPT-5.5-based)

📌 Overview

This benchmark evaluates localization performance across major GenAI providers using the GenAI G11n Assessment Model for es-MX scenarios.

The evaluation focuses on four core categories:

Language & Grammar
Instruction & Response Coherence
Cultural Adaptation
Multimodal Consistency

⚠️ Important Context

Between benchmark versions:

prompts evolved,
localization updates were introduced,
prompt complexity increased,
and the AI models themselves received internal updates.

Because of this:

Score differences should be interpreted as performance indicators rather than definitive proof of quality improvement or regression.

📊 Overall Average Scores

Model	v1.0.2	v1.2	Difference
ChatGPT	94.58	96.36	🟢 +1.78
DeepSeek	96.16	96.02	⚪ -0.14
Gemini	97.65	93.82	🔴 -3.83
Copilot	97.53	93.16	🔴 -4.37

Key Findings

ChatGPT showed the strongest overall improvement.
DeepSeek remained the most stable model across iterations.
Gemini and Copilot experienced the largest overall decreases.

📝 Language & Grammar

Model	v1.0.2	v1.2
ChatGPT	94.72	97.41
DeepSeek	96.67	94.81
Gemini	97.50	95.93
Copilot	98.33	94.63

Key Findings

ChatGPT showed the strongest linguistic improvement.
DeepSeek remained relatively stable.
Gemini maintained strong grammatical consistency.
Copilot experienced the largest grammar-related decrease.

🧠 Instruction & Response Coherence

Model	v1.0.2	v1.2
ChatGPT	94.41	91.96
DeepSeek	95.00	92.75
Gemini	95.59	88.04
Copilot	97.94	89.61

Key Findings

This category showed the largest regression overall.
Gemini and Copilot were the most affected.
ChatGPT and DeepSeek also experienced moderate decreases.

🌎 Cultural Adaptation

Model	v1.0.2	v1.2
ChatGPT	95.86	97.93
DeepSeek	94.83	100.00
Gemini	98.62	97.24
Copilot	94.83	99.31

Key Findings

Strongest-performing category overall.
DeepSeek and Copilot showed major localization improvements.
ChatGPT also improved significantly.
Gemini remained highly competitive despite a slight decrease.

🖼️ Multimodal Consistency

Model	v1.0.2	v1.2
ChatGPT	93.33	98.15
DeepSeek	98.15	96.51
Gemini	98.89	94.07
Copilot	99.03	89.07

Key Findings

ChatGPT showed the strongest multimodal improvement.
DeepSeek remained stable across iterations.
Gemini experienced moderate regression.
Copilot showed the largest multimodal decrease.

🚀 Final Takeaway

The GenAI G11n Assessment Model successfully identified measurable differences in es-MX localization performance across major GenAI providers.

The results suggest that:

localization quality remains highly competitive,
model behavior continues evolving rapidly,
and prompt engineering significantly impacts evaluation outcomes.

Most importantly:

Modern GenAI evaluation requires controlled benchmarking methodologies to reliably distinguish between prompt effects and underlying model evolution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment