Improve eval mainly

4e4f7dc verified 6 days ago

7.2 kB

	---
	license: llama3.1
	language:
	- el
	- en
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- text-generation-inference
	---

	# Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language

	Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
	Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).

	![image/png](llama-krikri-image.jpg)

	# Model Information

	- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
	- 128k context length (approximately 80,000 Greek words)
	- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
	* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
	* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
	* The training corpus also contains 7.8 billion math and code tokens.
	* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:


	\| Sub-corpus \| # Tokens \| Percentage \|
	\|-----------\|------------------\|------------\|
	\| Greek \| 56.7 B \| 62.3 % \|
	\| English \| 21.0 B \| 23.1 % \|
	\| Parallel \| 5.5 B \| 6.0 % \|
	\| Math/Code \| 7.8 B \| 8.6 % \|
	\| Total \| 91 B \| 100% \|


	Chosen subsets of the 91 billion corpus were upsampled resulting in a size of 110 billion tokens.


	# How to use

	## With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	device = "cuda"

	model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
	tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")

	model.to(device)

	input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
	outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)

	print(tokenizer.batch_decode(outputs)[0])
	```

	## With OpenAI compatible server via vLLM

	```bash
	vllm serve ilsp/Llama-Krikri-8B-Base \
	--enforce-eager \
	--dtype 'bfloat16' \
	--api-key token-abc123
	```

	Then, the model can be used through Python using:
	```python
	from openai import OpenAI

	api_key = "token-abc123"
	base_url = "http://localhost:8000/v1"

	client = OpenAI(
	api_key=api_key,
	base_url=base_url,
	)

	response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
	prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
	print(response.choices[0].text)
	```

	# Evaluation

	Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
	- +10.8% on Greek benchmarks
	- +0.8% on English benchmarks

	Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

	## Greek Benchmarks


	The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).

	Our evaluation suite includes:
	* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
	* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
	* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).

	We can see that our training enhances performance across all Greek test sets by a +10.8% average improvement. The results for the Greek test sets are shown in the following table:

	\| \| Medical MCQA EL (15-shot) \| Belebele EL (5-shot) \| HellaSwag EL (10-shot) \| ARC-Challenge EL (25-shot) \| TruthfulQA MC2 EL (0-shot) \| MMLU EL (5-shot) \| Average \|
	\|----------------\|----------------\|-------------\|--------------\|------------------\|-------------------\|---------\|---------\|
	\| Meltemi 7B v1.5 \| 42.2% \| 61.0% \| 53.8% \| 40.0% \| 49.0% \| 41.2% \| 47.9% \|
	\| Llama-3.1-8B \| 33.4% \| 72.8% \| 52.1% \| 39.9% \| 51.1% \| 42.6% \| 48.7% \|
	\| Llama-Krikri-8B \| 53.8% \| 82.7% \| 64.6% \| 49.4% \| 54.2% \| 52.0% \| 59.5% \|


	## English Benchmarks

	We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by +0.8%. The results for the English test sets are shown in the following table:

	\| \| Winogrande (5-shot) \| Belebele (5-shot) \| HellaSwag (10-shot) \| ARC-Challenge (25-shot) \| TruthfulQA MC2 (0-shot) \| MMLU (5-shot) \| Average \|
	\|----------------\|----------------\|-------------\|--------------\|------------------\|-------------------\|---------\|---------\|
	\| Meltemi 7B v1.5 \| 73.4% \| 77.7% \| 79.6% \| 54.1% \| 40.5% \| 56.9% \| 63.7% \|
	\| Llama-3.1-8B \| 74.6% \| 71.5% \| 82.0% \| 58.5% \| 44.2% \| 66.2% \| 66.2% \|
	\| Llama-Krikri-8B \| 72.6% \| 79.8% \| 80.7% \| 57.8% \| 44.8% \| 65.1% \| 67.0% \|

	Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5


	# Ethical Considerations

	This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.


	# Acknowledgements

	The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.