--- library_name: transformers datasets: - fanlino/lol-champion-qa language: - ko base_model: - google/gemma-2-2b-it --- # Model Card for Model ID ## Model Details ### Model Description This model is a fine-tuned version of google/gemma-2-2b-it, designed to answer questions related to champions from the online game League of Legends. By using a custom dataset of champion stories and lore, the model is optimized to generate responses in Korean. - **Developed by:** Dohyun Kim, Jongbong Lee, Jaehoon Kim - **Model type:** LLM Finetuned Model - **Language(s) (NLP):** Korean - **Finetuned from model [optional]:** google/gemma-2-2b-it ## Training Details ### Training Data The dataset was created by scraping champion lore from the official League of Legends website, transforming the content into Q&A format using large language models. You can find the dataset at fanlino/lol-champion-qa. ``` # List of champions champions = [ "aatrox", "ahri", "akali", "akshan", "alistar", "amumu", "anivia", "annie", "aphelios", "ashe", "aurelionsol", "azir", "bard", "belveth", "blitzcrank", "brand", "braum", "caitlyn", "camille", "cassiopeia", "chogath", "corki", "darius", "diana", "drmundo", "draven", "ekko", "elise", "evelynn", "ezreal", "fiddlesticks", "fiora", "fizz", "galio", "gangplank", "garen", "gnar", "gragas", "graves", "gwen", "hecarim", "heimerdinger", "illaoi", "irelia", "ivern", "janna", "jarvaniv", "jax", "jayce", "jhin", "jinx", "kaisa", "kalista", "karma", "karthus", "kassadin", "katarina", "kayle", "kayn", "kennen", "khazix", "kindred", "kled", "kogmaw", "leblanc", "leesin", "leona", "lillia", "lissandra", "lucian", "lulu", "lux", "malphite", "malzahar", "maokai", "masteryi", "milio", "missfortune", "mordekaiser", "morgana", "naafiri", "nami", "nasus", "nautilus", "neeko", "nidalee", "nilah", "nocturne", "nunu", "olaf", "orianna", "ornn", "pantheon", "poppy", "pyke", "qiyana", "quinn", "rakan", "rammus", "reksai", "rell", "renataglasc", "renekton", "rengar", "riven", "rumble", "ryze", "samira", "sejuani", "senna", "seraphine", "sett", "shaco", "shen", "shyvana", "singed", "sion", "sivir", "skarner", "sona", "soraka", "swain", "sylas", "syndra", "tahmkench", "taliyah", "talon", "taric", "teemo", "thresh", "tristana", "trundle", "tryndamere", "twistedfate", "twitch", "udyr", "urgot", "varus", "vayne", "veigar", "velkoz", "vex", "vi", "viego", "viktor", "vladimir", "volibear", "warwick", "monkeyking", "xayah", "xerath", "xinzhao", "yasuo", "yone", "yorick", "yuumi", "zac", "zed", "ziggs", "zilean", "zoe", "zyra" ] print(f"The total number of champions: {len(champions)}") # Base URL for the champion story in Korean base_url = "https://universe.leagueoflegends.com/ko_KR/story/champion/" # Function to scrape the Korean name and background story of a champion def scrape_champion_data(champion): url = base_url + champion + "/" response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') # Extract the Korean name from the tag korean_name = soup.find('title').text.split('-')[0].strip() # Extract the background story from the meta description meta_description = soup.find('meta', {'name': 'description'}) if meta_description: background_story = meta_description.get('content').replace('\n', ' ').strip() else: background_story = "No background story available" return korean_name, background_story else: return None, None # Open the CSV file for writing with open("champion_bs.csv", "w", newline='', encoding='utf-8') as csvfile: # Define the column headers fieldnames = ['url-name', 'korean-name', 'background-story'] # Create a CSV writer object writer = csv.DictWriter(csvfile, fieldnames=fieldnames) # Write the header writer.writeheader() # Scrape data for each champion and write to CSV for champion in champions: korean_name, background_story = scrape_champion_data(champion) if korean_name and background_story: writer.writerow({ 'url-name': champion, 'korean-name': korean_name, 'background-story': background_story }) print(f"Scraped data for {champion}: {korean_name}") else: print(f"Failed to scrape data for {champion}") print("Data scraping complete. Saved to champion_bs.csv") ``` ### Training Procedure <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> **Environment Setup** The model was fine-tuned using a quantization-aware training approach to optimize memory usage and computational efficiency. The environment was set up with 4-bit quantization using torch and transformers, and the LoRA (Low-Rank Adaptation) method was applied to specific layers of the model to improve task performance. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "google/gemma-2-2b-it" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=qlora_config, device_map="auto", attn_implementation=attn_implementation ) tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) ``` **QLoRA Setting** ```python from peft import LoraConfig, get_peft_model def find_linear_layers(model): linear_layers = set() for name, module in model.named_modules(): if isinstance(module, bnb.nn.Linear4bit): names = name.split('.') layer_name = names[-1] if layer_name != 'lm_head': linear_layers.add(layer_name) return list(linear_layers) lora_target_modules = find_linear_layers(model) lora_config = LoraConfig( r=64, lora_alpha=32, target_modules=lora_target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) ``` **Loading Training Datasets** To prepare the training data, the champion stories were converted into a question-answer format. The dataset was structured using a chat-style template to ensure compatibility with the Gemma2 model’s architecture. ```python data = [ { "q": "대부분의 필멸자가 알고 있는 현실 차원은 무엇인가?", "a": "대부분의 필멸자는 물질 세계라는 하나의 현실 차원만 알고 있다." }, { "q": "오로라가 유년 시절을 보낸 곳은 어디인가?", "a": "오로라는 브뤼니 부족의 고향이자 외딴 마을인 아무우에서 유년 시절을 보냈다." }, { "q": "오로라가 자신을 이해해준 유일한 가족 구성원은 누구인가?", "a": "오로라의 이모할머니 하부우가 오로라를 진심으로 받아들였다." }, ...] qa_df = pd.DataFrame(data, columns=["q", "a"]) qa_dataset = Dataset.from_pandas(qa_df) ``` We use gemma2's chat format template. ```python <start_of_turn>user {Qustion}<end_of_turn> <start_of_turn>model {Answer} <end_of_turn> ``` And we write a function to structure a dataset. ```python def format_chat_prompt(example): chat_data = [ {"role": "user", "content": example["q"]}, {"role": "assistant", "content": example["a"]} ] example["text"] = tokenizer.apply_chat_template(chat_data, tokenize=False) return example dataset = dataset.map(format_chat_prompt, num_proc=4) ``` The actual format results in the following text. ``` <bos> <start_of_turn>user 아트록스가 태어난 곳은 어디인가?<end_of_turn> <start_of_turn>model 아트록스는 슈리마에서 태어났다.<end_of_turn>'} ``` **Training Model** The model was then trained using the SFTTrainer class, with settings such as a batch size of 1, 10 gradient accumulation steps, and 10 epochs. The optimizer used was paged_adamw_32bit. ```python import transformers from trl import SFTTrainer # Training arguments training_args = TrainingArguments( output_dir=OUTPUT_MODEL_PATH, per_device_train_batch_size=1, # steps_per_epoch = ceil(total_samples / (batch_size * gradient_accumulation_steps)) gradient_accumulation_steps=10, # total_samples means len(dataset) num_train_epochs=10, learning_rate=2e-4, fp16=False, bf16=False, logging_steps=len(dataset)//10, optim="paged_adamw_32bit", logging_dir="./logs", save_strategy="epoch", evaluation_strategy="no", do_eval=False, group_by_length=True, report_to="none" ) # Initialize trainer trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=lora_config, dataset_text_field="text", max_seq_length=512, tokenizer=tokenizer, args=training_args, packing=False, ) # Train the model trainer.train() ``` **Testing Model** We created a helper function to ask the question in the format. ```python def generate_response(prompt, model, tokenizer, temperature=0.1): formatted_prompt=f"""<start_of_turn>user {prompt}<end_of_turn> <start_of_turn>model """ inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=256, do_sample=temperature > 0, temperature=temperature ) return tokenizer.decode(outputs[0], skip_special_tokens=False) ``` **Question** ```python prompt = "조이는 아우렐리온 솔한테 무슨 약속을 했어?" response = generate_response(prompt, model, tokenizer) print(response) ``` **예상 답변** ``` 조이는 아우렐리온 솔을 지키기 위해 할 수 있는 것은 무엇이든 해주리라 약속했다. ``` **결과(Finetuned Model)** ``` <bos><start_of_turn>user 조이는 아우렐리온 솔한테 무슨 약속을 했어?<end_of_turn> <start_of_turn>model 조이는 아우렐리온 솔을 지키기 위해 할 수 있는 것은 무엇이든 해주리라 약속했다.<end_of_turn> ``` **결과(Base Model)** ``` <bos><start_of_turn>user 조이는 아우렐리온 솔한테 무슨 약속을 했어?<end_of_turn> <start_of_turn>model 조이는 아우렐리온 솔한테 **무슨 약속을 했는지**에 대한 정보는 아직 알려지지 않았습니다. 조이는 아우렐리온 솔한테 약속을 했는지에 대한 이야기는 몇 가지 유행하는 밈과 관련된 것으로 보입니다. * **아우렐리온 솔:** 이것은 2023년 1월에 출시된 아우렐리온 솔의 이름입니다. * **조이:** 이것은 2023년 1월에 출시된 아우렐리온 솔의 이름입니다. 이러한 밈들은 흥미롭지만, 실제로 조이는 아우렐리온 솔한테 무슨 약속을 했는지에 대한 정확한 정보는 아직 알려지지 않았습니다. <end_of_turn> ``` In contrast, the base model’s response was less accurate, highlighting the improvements made through fine-tuning. #### Summary The code discussed above can be found at the following link: [lol_lore.ipynb](https://github.com/star-bits/mlb-gemma/blob/main/lol_lore.ipynb)