|
--- |
|
license: apache-2.0 |
|
--- |
|
# Intruduction |
|
|
|
|
|
# Eval |
|
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark) |
|
| Model | Model Accuracy | |
|
|---------------|----------------| |
|
| mistral7b | 0.4992 | |
|
| csmpt-130k | __0.5004__ | |
|
| csmpt-100k | 0.4959 | |
|
| csmpt-75k | 0.4895 | |
|
| csmpt-50k steps | 0.4755 | |
|
| csmpt-26.5k steps | 0.4524 | |
|
|
|
|
|
However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any. |
|
The improvement over mistral7b is not significant. |
|
|
|
## Loss |
|
tbd. |
|
|
|
|
|
## Training Method |
|
tbd. |
|
|
|
|
|
# Usage |
|
## How to Setup Environment |
|
```bash |
|
pip install transformers==4.37.2 torch==2.1.2 einops==0.7.0 |
|
|
|
# be sure to install right flash-attn, we use torch compiled with CUDA 12.1, no ABI, python 3.9, Linux x86_64 architecture |
|
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.3/flash_attn-2.5.3+cu122torch2. |
|
1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl |
|
``` |
|
|
|
## Running the Code |
|
```python |
|
import torch |
|
import transformers |
|
from transformers import pipeline |
|
|
|
name = 'BUT-FIT/csmpt7b' |
|
|
|
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) |
|
config.init_device = 'cuda:0' # For fast initialization directly on GPU! |
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
name, |
|
config=config, |
|
torch_dtype=torch.bfloat16, # Load model weights in bfloat16 |
|
trust_remote_code=True |
|
) |
|
|
|
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True) |
|
|
|
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0') |
|
|
|
with torch.autocast('cuda', dtype=torch.bfloat16): |
|
print( |
|
pipe('Nejznámějším českým spisovatelem ', |
|
max_new_tokens=100, |
|
top_p=0.95, |
|
repetition_penalty=1.0, |
|
do_sample=True, |
|
use_cache=True)) |
|
|
|
``` |
|
# Training Data |
|
We release most of our training data here \[TBD MDocekal.\]. |
|
|
|
|
|
# Our Release Plan |
|
| Stage | Description | Date | |
|
|---------------|----------------|----------------| |
|
| 1 | 'Best' model + training data | 11.03.2024 |
|
| 2 | All checkpoints + training code| |
|
| 3 | __Benczechmark__ a collection of Czech datasets for few-shot LLM evaluation **Get in touch if you want to contribute!** | |
|
| 4 | Preprint Publication | |
|
|
|
## Getting in Touch |
|
For further questions, email to `martin.fajcik@vut.cz`. |
|
|
|
# Disclaimer |
|
This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk. |
|
|
|
|
|
# Acknowledgement |
|
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT --- |
|
"Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and |
|
by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`). |