--- base_model: None tags: - generated_from_trainer model-index: - name: checkpoints-mistral-300M-FA2 results: [] --- # japanese-mistral-300m-base ## Overview Welcome to my model card! This Model feature is ... - Suppression of unknown word generation by using byte fallback in SentencePiece tokenizer and conversion to huggingface Tokenizers format - Pretrained by wikipedia dataset and cc100 dataset - Use of [Mistral 300M](https://huggingface.co/ce-lery/japanese-mistral-300m-base/blob/main/config.json) Yukkuri shite ittene! ## How to use the model ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer import torch MODEL_NAME = "ce-lery/japanese-mistral-300m-base" torch.set_float32_matmul_precision('high') DEVICE = "cuda" if torch.cuda.is_available(): print("cuda") DEVICE = "cuda" else: print("cpu") DEVICE = "cpu" tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,use_fast=False) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, trust_remote_code=True, ).to(DEVICE) # streamer = TextStreamer(tokenizer) prompt = "大規模言語モデルとは、" inputs = tokenizer(prompt, add_special_tokens=False,return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( inputs["input_ids"], max_new_tokens=256, do_sample=True, early_stopping=False, top_p=0.95, top_k=50, temperature=0.9, # streamer=streamer, no_repeat_ngram_size=2, num_beams=3 ) print(outputs.tolist()[0]) outputs_txt = tokenizer.decode(outputs[0]) print(outputs_txt) ``` ## Receipe If you want to restruct this model, you can refer [this Github repository](https://github.com/ce-lery/japanese-mistral-300m-recipe). I wrote the receipe for struction this model. For example, - Preprocess with sentencepiece - Pretraining with flash attention2 and torch.compile and DeepSpeed - Fine-tuning with databricks-dolly-15k-ja If you find my mistake,error,...etc, please create issue. If you create pulreqest, I'm very happy! ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0006 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - distributed_type: multi-GPU - gradient_accumulation_steps: 64 - total_train_batch_size: 256 - optimizer: Adam with betas=(0.9,0.95) and epsilon=0.0001 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 1000 - num_epochs: 1 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:-----:|:---------------:| | 4.2911 | 0.12 | 5000 | 4.2914 | | 3.9709 | 0.24 | 10000 | 3.9900 | | 3.8229 | 0.36 | 15000 | 3.8388 | | 3.7197 | 0.47 | 20000 | 3.7454 | | 3.652 | 0.59 | 25000 | 3.6739 | | 3.597 | 0.71 | 30000 | 3.6177 | | 3.5554 | 0.83 | 35000 | 3.5770 | | 3.536 | 0.95 | 40000 | 3.5582 | ### Framework versions - Transformers 4.35.2 - Pytorch 2.1.1+cu121 - Datasets 2.14.5 - Tokenizers 0.14.1