FinShibainu Model Card

λͺ¨λΈμ€ KRX LLM κ²½μ§„λŒ€νšŒ λ¦¬λ”λ³΄λ“œμ—μ„œ μš°μˆ˜μƒμ„ μˆ˜μƒν•œ shibainu24 λͺ¨λΈμž…λ‹ˆλ‹€. λͺ¨λΈμ€ 금육, νšŒκ³„ λ“± κΈˆμœ΅κ΄€λ ¨ 지식에 λŒ€ν•œ Text Generation을 μ œκ³΅ν•©λ‹ˆλ‹€.

데이터셋 μˆ˜μ§‘ 및 ν•™μŠ΅μ— κ΄€λ ¨λœ μ½”λ“œλŠ” https://github.com/aiqwe/FinShibainu에 μžμ„Έν•˜κ²Œ κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

Usage

https://github.com/aiqwe/FinShibainu의 example을 μ°Έμ‘°ν•˜λ©΄ μ‰½κ²Œ inferenceλ₯Ό ν•΄λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. λŒ€λΆ€λΆ„μ˜ InferenceλŠ” RTX-3090 μ΄μƒμ—μ„œ 단일 GPU κ°€λŠ₯ν•©λ‹ˆλ‹€.

pip install vllm
import pandas as pd
from vllm import LLM

inputs = [
    "μ™Έν™˜μ‹œμž₯μ—μ„œ 일본 엔화와 λ―Έκ΅­ λ‹¬λŸ¬μ˜ ν™˜μœ¨μ΄ 두 μ‹œμž₯μ—μ„œ μ•½κ°„μ˜ 차이λ₯Ό 보이고 μžˆλ‹€. μ΄λ•Œ λ¬΄μœ„ν—˜ 이읡을 μ–»κΈ° μœ„ν•œ μ μ ˆν•œ 거래 μ „λž΅μ€ 무엇인가?",
    "μ‹ μ£ΌμΈμˆ˜κΆŒλΆ€μ‚¬μ±„(BW)μ—μ„œ μ±„κΆŒμžκ°€ μ‹ μ£ΌμΈμˆ˜κΆŒμ„ ν–‰μ‚¬ν•˜μ§€ μ•Šμ„ 경우 μ–΄λ–€ 일이 λ°œμƒν•˜λŠ”κ°€?",
    "곡맀도(Short Selling)에 λŒ€ν•œ μ„€λͺ…μœΌλ‘œ μ˜³μ§€ μ•Šμ€ 것은 λ¬΄μ—‡μž…λ‹ˆκΉŒ?"
]

llm = LLM(model="aiqwe/krx-llm-competition", tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(inputs, sampling_params)
for o in outputs:
    print(o.prompt)
    print(o.outputs[0].text)
    print("*"*100)

Model Card

Contents Spec
Base model Qwen2.5-7B-Instruct
dtype bfloat16
PEFT LoRA (r=8, alpha=64)
Learning Rate 1e-5 (varies by further training)
LRScheduler Cosine (warm-up: 0.05%)
Optimizer AdamW
Distributed / Efficient Tuning DeepSpeed v3, Flash Attention

Datset Card

Reference 데이터셋은 일뢀 μ €μž‘κΆŒ κ΄€κ³„λ‘œ 인해 Link둜 μ œκ³΅ν•©λ‹ˆλ‹€. MCQA와 QA 데이터셋은 https://huggingface.co/datasets/aiqwe/FinShibainu으둜 κ³΅κ°œν•©λ‹ˆλ‹€.
λ˜ν•œ https://github.com/aiqwe/FinShibainuλ₯Ό μ΄μš©ν•˜λ©΄ λ‹€μ–‘ν•œ μœ ν‹Έλ¦¬ν‹° κΈ°λŠ₯을 μ œκ³΅ν•˜λ©°, 데이터 μ†Œμ‹± Pipeline을 μ°Έμ‘°ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

References

데이터λͺ… url
ν•œκ΅­μ€ν–‰ 경제금육 μš©μ–΄ 700μ„  Link
μž¬λ¬΄νšŒκ³„ ν•©μ„± 데이터 자체 μ œμž‘
κΈˆμœ΅κ°λ…μš©μ–΄μ‚¬μ „ Link
web-text.synthetic.dataset-50k Link
μ§€μ‹κ²½μ œμš©μ–΄μ‚¬μ „ Link
ν•œκ΅­κ±°λž˜μ†Œ λΉ„μ •κΈ° κ°„ν–‰λ¬Ό Link
ν•œκ΅­κ±°λž˜μ†Œκ·œμ • Link
초보투자자 μ¦κΆŒλ”°λΌμž‘κΈ° Link
μ²­μ†Œλ…„μ„ μœ„ν•œ 증ꢌ투자 Link
κΈ°μ—…μ‚¬μ—…λ³΄κ³ μ„œ κ³΅μ‹œμžλ£Œ Link
μ‹œμ‚¬κ²½μ œμš©μ–΄μ‚¬μ „ Link

MCQA

MCQA λ°μ΄ν„°λŠ” Referenceλ₯Ό 기반으둜 λ‹€μ§€μ„ λ‹€ν˜• 문제λ₯Ό μƒμ„±ν•œ λ°μ΄ν„°μ…‹μž…λ‹ˆλ‹€. λ¬Έμ œμ™€ λ‹΅ 뿐만 μ•„λ‹ˆλΌ Reasoning ν…μŠ€νŠΈκΉŒμ§€ μƒμ„±ν•˜μ—¬ ν•™μŠ΅μ— μΆ”κ°€ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
ν•™μŠ΅μ— μ‚¬μš©λœ λ°μ΄ν„°λŠ” μ•½ 4.5만개 데이터셋이며, tiktoken의 o200k_base(gpt-4o, gpt-4o-mini Tokenizer)λ₯Ό κΈ°μ€€μœΌλ‘œ 총 2천만개의 ν† ν°μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

데이터λͺ… 데이터 수 토큰 수
ν•œκ΅­μ€ν–‰ 경제금육 μš©μ–΄ 700μ„  1,203 277,114
μž¬λ¬΄νšŒκ³„ λͺ©μ°¨λ₯Ό μ΄μš©ν•œ 합성데이터 451 99,770
κΈˆμœ΅κ°λ…μš©μ–΄μ‚¬μ „ 827 214,297
hf_web_text_synthetic_dataset_50k 25,461 7,563,529
μ§€μ‹κ²½μ œμš©μ–΄μ‚¬μ „ 2,314 589,763
ν•œκ΅­κ±°λž˜μ†Œ λΉ„μ •κΈ° κ°„ν–‰λ¬Ό 1,183 230,148
ν•œκ΅­κ±°λž˜μ†Œκ·œμ • 3,015 580,556
초보투자자 μ¦κΆŒλ”°λΌμž‘κΈ° 599 116,472
μ²­μ†Œλ…„μ„ μœ„ν•œ 증ꢌ 투자 408 77,037
κΈ°μ—…μ‚¬μ—…λ³΄κ³ μ„œ κ³΅μ‹œμžλ£Œ 3,574 629,807
μ‹œμ‚¬κ²½μ œμš©μ–΄μ‚¬μ „ 7,410 1,545,842
합계 46,445 19,998,931

QA

QA λ°μ΄ν„°λŠ” Reference와 μ§ˆλ¬Έμ„ ν•¨κ»˜ Input으둜 λ°›μ•„ μƒμ„±ν•œ λ‹΅λ³€κ³Ό Reference 없이 μ§ˆλ¬Έλ§Œμ„ Input으둜 λ°›μ•„ μƒμ„±ν•œ λ‹΅λ³€ 2κ°€μ§€λ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€.
Referenceλ₯Ό μ œκ³΅λ°›μœΌλ©΄ λͺ¨λΈμ€ 보닀 μ •ν™•ν•œ 닡변을 ν•˜μ§€λ§Œ λͺ¨λΈλ§Œμ˜ 지식이 μ œν•œλ˜μ–΄ 닡변이 쒀더 μ§§μ•„μ§€κ±°λ‚˜ 닀양성이 μ€„μ–΄λ“€κ²Œ λ©λ‹ˆλ‹€. 총 4.8만개의 데이터셋과 2μ–΅κ°œμ˜ ν† ν°μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

데이터λͺ… 데이터 수 토큰 수
ν•œκ΅­μ€ν–‰ 경제금육 μš©μ–΄ 700μ„  1,023 846,970
κΈˆμœ΅κ°λ…μš©μ–΄μ‚¬μ „ 4,128 3,181,831
μ§€μ‹κ²½μ œμš©μ–΄μ‚¬μ „ 6,526 5,311,890
ν•œκ΅­κ±°λž˜μ†Œ λΉ„μ •κΈ° κ°„ν–‰λ¬Ό 1,510 1,089,342
ν•œκ΅­κ±°λž˜μ†Œκ·œμ • 4,858 3,587,059
κΈ°μ—…μ‚¬μ—…λ³΄κ³ μ„œ κ³΅μ‹œμžλ£Œ 3,574 629,807
μ‹œμ‚¬κ²½μ œμš©μ–΄μ‚¬μ „ 29,920 5,981,839
합계 47,965 199,998,931

Citation

@misc{jaylee2024finshibainu,
  author = {Jay Lee},
  title = {FinShibainu: Korean specified finance model},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/aiqwe/FinShibainu}
}
Downloads last month
75
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aiqwe/FinShibainu

Base model

Qwen/Qwen2.5-7B
Finetuned
(993)
this model
Quantizations
2 models

Dataset used to train aiqwe/FinShibainu