Comparative es-MX Performance Evaluation

#4
by AndresCas - opened
Dilato Infotech Limited org

Evaluated Models

  • ChatGPT (GPT-4o β†’ GPT-5.5)
  • DeepSeek V3
  • Gemini 2.5 Flash β†’ Gemini 3 Flash
  • Copilot (GPT-4o-based β†’ GPT-5.5-based)

πŸ“Œ Overview

This benchmark evaluates localization performance across major GenAI providers using the GenAI G11n Assessment Model for es-MX scenarios.

The evaluation focuses on four core categories:

  • Language & Grammar
  • Instruction & Response Coherence
  • Cultural Adaptation
  • Multimodal Consistency

⚠️ Important Context

Between benchmark versions:

  • prompts evolved,
  • localization updates were introduced,
  • prompt complexity increased,
  • and the AI models themselves received internal updates.

Because of this:

Score differences should be interpreted as performance indicators rather than definitive proof of quality improvement or regression.


πŸ“Š Overall Average Scores

6

Model v1.0.2 v1.2 Difference
ChatGPT 94.58 96.36 🟒 +1.78
DeepSeek 96.16 96.02 βšͺ -0.14
Gemini 97.65 93.82 πŸ”΄ -3.83
Copilot 97.53 93.16 πŸ”΄ -4.37

Key Findings

  • ChatGPT showed the strongest overall improvement.
  • DeepSeek remained the most stable model across iterations.
  • Gemini and Copilot experienced the largest overall decreases.

πŸ“ Language & Grammar

2

Model v1.0.2 v1.2
ChatGPT 94.72 97.41
DeepSeek 96.67 94.81
Gemini 97.50 95.93
Copilot 98.33 94.63

Key Findings

  • ChatGPT showed the strongest linguistic improvement.
  • DeepSeek remained relatively stable.
  • Gemini maintained strong grammatical consistency.
  • Copilot experienced the largest grammar-related decrease.

🧠 Instruction & Response Coherence

3

Model v1.0.2 v1.2
ChatGPT 94.41 91.96
DeepSeek 95.00 92.75
Gemini 95.59 88.04
Copilot 97.94 89.61

Key Findings

  • This category showed the largest regression overall.
  • Gemini and Copilot were the most affected.
  • ChatGPT and DeepSeek also experienced moderate decreases.

🌎 Cultural Adaptation

4

Model v1.0.2 v1.2
ChatGPT 95.86 97.93
DeepSeek 94.83 100.00
Gemini 98.62 97.24
Copilot 94.83 99.31

Key Findings

  • Strongest-performing category overall.
  • DeepSeek and Copilot showed major localization improvements.
  • ChatGPT also improved significantly.
  • Gemini remained highly competitive despite a slight decrease.

πŸ–ΌοΈ Multimodal Consistency

5

Model v1.0.2 v1.2
ChatGPT 93.33 98.15
DeepSeek 98.15 96.51
Gemini 98.89 94.07
Copilot 99.03 89.07

Key Findings

  • ChatGPT showed the strongest multimodal improvement.
  • DeepSeek remained stable across iterations.
  • Gemini experienced moderate regression.
  • Copilot showed the largest multimodal decrease.

πŸš€ Final Takeaway

The GenAI G11n Assessment Model successfully identified measurable differences in es-MX localization performance across major GenAI providers.

The results suggest that:

  • localization quality remains highly competitive,
  • model behavior continues evolving rapidly,
  • and prompt engineering significantly impacts evaluation outcomes.

Most importantly:

Modern GenAI evaluation requires controlled benchmarking methodologies to reliably distinguish between prompt effects and underlying model evolution.

Sign up or log in to comment