metadata
license: apache-2.0
datasets:
- cerebras/SlimPajama-627B
- bigcode/starcoderdata
- sam-mosaic/orca-gpt4-chatml
- alvations/globalvoices-en-es
language:
- en
- es
This is a finetuned version with a partial dataset from alvations/globalvoices-en-es to test performance on translation task. It has been trained to translate english to spanish and viceversa with only 20k rows from the dataset.
The translation is not very accurate but it shows a lot of potential.
In order to use it you have to follow the chatml standard like so:
english to spanish:
<|im_start|>user Translate this to spanish:
A father and son, who have been living off grid for 20 years, encounter an outsider who threatens to destroy the utopia they've built.<|im_start|>assistant
This will provide the following result:
Un padre y hijo, que han vivido sin comida desde hace 20 años, encuentran un invitado quien amenaza con destruir la utopía que ellos han creado.
spanish to english:
<|im_start|>user Traduce esto al ingles: ```España se queda sin Copilot para Windows 11: la regulación de la UE frena su despliegue en Europa.```
<|im_start|>assistant
Which will be completed as:
Spain is left without Copilot for Windows 11: the control of the UE has halted its deployment in Europe.
The results are far from perfect but there are A LOT of room to improvement since it was finetuned with only 20k rows from the dataset (which has 355k rows) for 2 epoch. This training took only about 5 hours on a "M1 Pro" processor.
The base model used is a fine-tuned model with orca dataset acalatrava/TinyLlama-1.1B-orca-gpt4
Training
- Method: QLORA
- Time: 10h on a M1 Pro 32GB
- Based on: https://colab.research.google.com/drive/1Zmaceu65d7w4Tcd-cfnZRb6k_Tcv2b8g removing quantization since it's not supported on MPS