Krakowiak-7B-v2 / README.md
szymonrucinski's picture
Update README.md
3f06f13 verified
metadata
license: cc-by-sa-4.0
language:
  - pl

Model Card for Krakowiak-v2-7b

Krakowiak-v2-7b is a state of the art 7.3 billion parameters LLM based on Mistral-7B. It was finetuned for Polish text generation using custom created large corpus of 300K Polish instructions. It uses novel techniques e.g. LORA, adding noise to the embeddings for greater preformance. For full details of this model please read our paper to be released soon

Model Architecture

Krakowiak-v2-7b is a huge update to Krakowiak-7b bringing the following improvements:

  • Grouped-Query Attention
  • Sliding-Window Attention
  • Byte-fallback BPE tokenizer
  • Significantly larger training corpus of better quality
  • Improved training pipeline
  • Faster inference
  • No random token generation (generating Russian or Czech text alongside Polish)
  • Significantly higher quality of generated text

Krakowiak does not have any censorship mechanisms implemented.

Instruction format

In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

E.g.

text = "<s>[INST] Czy warto si臋 uczy膰? [/INST]"
"Jako model j臋zyka AI nie mam osobistych opinii ani przekona艅. Jednak uczanie si臋 jest og贸lnie cenione i anga偶uje d艂ugoterminowe korzy艣ci dla jednostki i spo艂ecze艅stwa. Dzi臋ki nauce mo偶na opracowa膰 umiej臋tno艣ci, zdob膮膰 wiedz臋 i zwi臋kszy膰 potencja艂 osobisty i zawodowy. Ponadto nauka promuje otwarto艣膰 na innowacje i mo偶e zapewni膰 lepsz膮 perspektyw臋 偶ycia. Dlatego uczanie si臋 jest og贸lnie cenione i warto艣ciowe dla os贸b, kt贸re chc膮 osi膮gn膮膰 cel lub rozwin膮膰 swoj膮 wiedz臋. Jednak ka偶da sytuacja mo偶e by膰 inna i niekt贸re osoby mog膮 mie膰 r贸偶ne cele lub okoliczno艣ci, kt贸re wp艂ywaj膮 na ich podej艣cie do nauki. Ostatecznie zale偶y to od indywidualnych preferencji i cel贸w.</s> "
"[INST] Gdzie warto si臋 uczy膰? [/INST]"

Model parameters

From my experience the temperature value of 0.7 is the best baseline value.

Optimal text generation

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

chat_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
    {"role": "user", "content": "Czy warto nauczy膰 si臋 je藕dzi膰 na nartach w wieku 25 lat?"},
]
chat_tokenized = tokenizer.apply_chat_template(messages, tokenize=False)

model = AutoModelForCausalLM.from_pretrained("szymonrucinski/Krakowiak-7B-v2")
model.config.pad_token_id = model.config.eos_token_id

tokenizer = AutoTokenizer.from_pretrained("szymonrucinski/Krakowiak-7B-v2",add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token
model_inputs = tokenizer(chat_tokenized, return_tensors='pt').to(torch_device)

beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=1,
    early_stopping=True
)
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))

Use a pipeline as a high-level helper

from transformers import pipeline

pipe = pipeline("text-generation", model="szymonrucinski/krakowiak-v2-7b")

pipe("<s>[INST] Te偶 lubisz je藕dzi膰 na rowerze? [/INST]")

Demo

You can play with Krakowiak-v2-7b here. This model uses 4-bit quantization technique and CPU inference that negatively influence the quality of the output but are more cost effective. You can run Krakowiak on your CPU using its quantized version availabe here

Krakowiak team

Szymon Franciszek Ruci艅ski

Citation

If you find the content of this repo useful in your work, please cite it as follows:

@misc{Krakowiak-V2-7B,
  author = {Szymon Ruci艅ski},
  title = {Krakowiak-V2-7B},
  year = {2023},
  publisher = {Huggingface},
  journal = {Huggingface repository},
  howpublished = {\url{https://huggingface.co/szymonrucinski/krakowiak-v2-7b/}}
}