KoBioMed-Llama-3.1-8B

Instroduction

We introduce KoBioMed-Llama-3.1-8B, a bilingual (English and Korean) generative model specialized in the BioMedical domain, developed by ezCaretech. This model has been continual pre-trained (CPT) on a dataset from PubMed abstracts and their translated Korean counterparts, undergoing extensive preprocessing that includes cleansing, de-duplication, and quality filtering.

Our KoBioMed-Llama-3.1-8B has achieved state-of-the-art performance on both Korean and English BioMedical benchmarks. We hope this model will contribute significantly to the biomedical and medical research community.

This repository contains an 8 Billion generative language model with the following key features:

Developed by: AI Team, ezCaretech R&D Center
Language Support: English and Korean
Context Length: 8,192 tokens
Vocab Size: 12,800
License: llama3.1

Notice!

This is a pre-trained model. It will be a great starting point for post-training, such as instruction tuning.
This model was developed with support from the Korea Artificial Intelligence Industry Cluster Agency (AICA).
The model is currently in post-training (Instruction Tuning, DPO) and is scheduled to be released within March 2025.

Evaluation

We evaluated the KoBioMed-Llama-3.1-8B using various Korean and English biomedical benchmarks.

Benchmark evaluations were carried out using EleutherAI/lm-evaluation-harness and performed with 5-shot examples.
The subsets used for the KMMLU and MMLU evaluations are listed below.
- KMMLU: 'kmmlu_direct_biology'
- MMLU: 'mmlu_college_biology', 'mmlu_clinical_knowledge', 'mmlu_anatomy', 'mmlu_college_medicine', 'mmlu_medical_genetics', 'mmlu_professional_medicine'

Models	KMMLU	KorMedMCQA	MedMCQA	MMLU	PubMedQA	Mean
KoBioMed-Llama-3.1-8B	0.4010	0.5705	0.5367	0.6837	0.7800	0.5944
Llama-3.1-8B	0.3620	0.5105	0.5635	0.7159	0.7600	0.5824
Mistral-7B-v0.3	0.3130	0.3958	0.4927	0.6693	0.7740	0.5290
Llama-3-Open-Ko-8B	0.3340	0.4941	0.4743	0.6251	0.7320	0.5319
SOLAR-10.7B-v1.0	0.3200	0.5146	0.5075	0.7050	0.7760	0.5646

Quickstart

Here is a code snippet for model inference.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = 'Lowenzahn/KoBioMed-Llama-3.1-8B'

# Load model
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16, 
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo)

# Inference
prompts = ["Machine learning is"]
inputs = tokenizer(prompts, return_tensors="pt")
gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.8, "do_sample": False, "repetition_penalty": 1.2}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)

Limitations

KoBioMed-Llama-3.1-8B demonstrates strong performance in the biomedical domain, but it can sometimes generate inappropriate responses. While we have made considerable efforts to avoid providing sensitive data, racial discrimination, harm, or biased information in the training data, issues may still arise. We emphasize that the text generated by KoBioMed-Llama-3.1-8B does not reflect the views of the ezCaretech R&D center AI Team.

The model may generate responses containing biased information related to age, gender, or race.
The model may generate responses containing personal information, harmful content, or other inappropriate information.
Since the model does not reflect the most up-to-date information, its responses may be outdated or contradictory.
The performance of model may degrade on tasks unrelated to the biomedical and healthcare domains.
KoBioMed-Llama-3.1-8B can make mistakes. Critical information should be verified independently.

Training Data

This model was trained on preprocessed abstracts of papers published in PubMed from 2000 to 2023. The preprocessing includes the following steps:

Removal of URLs
Removal of HTML tags
Removal of reference citations
Removal of Identifiable information
Min-Hash based duplication removal
Scoring model based low quality text removal

License

This model is released under llama3.1 license.

Supported by

This model was developed with support from the Korea Artificial Intelligence Industry Cluster Agency (AICA).

Contact

조형민(Hyeongmin Cho), hyeongmin0121@gmail.com
김인후(Inhu Kim), markaki72@gmail.com
이동형(Donghyoung Lee), abidan88@gmail.com
박달호(Dalho Park), dhpark@ezcaretech.com

Citation

KoBioMed-Llama-3.1-8B

@article{kobiomedllama,
  title={KoBioMed-Llama-3.1-8B},
  author={Hyeongmin Cho and Inhu Kim and Donghyoung Lee and Sanghwan Kim and Dalho Park and Inchul Kang and Kyul Kim and Jihoon Cho and Jongbeom Park},
  year={2025},
  url={https://huggingface.co/Lowenzahn/KoBioMed-Llama-3.1-8B}
}

Lowenzahn
/

KoBioMed-Llama-3.1-8B