|
--- |
|
language: |
|
- ko |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Kor-Gemma-2B |
|
|
|
> Update @ 2024.05.10: First release of gemma-ko |
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model card corresponds to the 2B-it version of the **Gemma-Ko** model. |
|
|
|
**Resources and Technical Documentation**: |
|
|
|
* [Original Gemma-2b-it](https://huggingface.co/google/gemma-2b-it) |
|
|
|
**Citation** |
|
|
|
```bibtex |
|
@misc {gemma-summary-v01 , |
|
author = { {frcp,nebchi,pepperonipizza} }, |
|
title = { gemma-summary-v01 }, |
|
year = 2024, |
|
url = { https://huggingface.co/cpm-ai/gemma-ko-v01 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
**Model Developers**: frcp, nebchi, pepperonipizza |
|
|
|
## Model Information |
|
|
|
I trained a language model using a dataset of 363,000 Korean text samples. |
|
|
|
### Description |
|
It has been trained with a large amount of Korean tokens compared to other LLMs, enabling it to generate high-quality Korean text. |
|
Additionally, it shows improved performance with less data compared to other LLM models. |
|
|
|
|
|
#### Running the model on a single / multi GPU |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("cpm-ai/gemma-ko-v01") |
|
model = AutoModelForCausalLM.from_pretrained("cpm-ai/gemma-ko-v01", device_map="auto") |
|
|
|
prompt = """μμ½ ν λ¬Έμ₯ : |
|
[μλ
νμΈμ μλ°©μ‘ ν λ‘ μΉ΄νμ
λλ€. |
|
μ€λ μ±νμ μ μΌ μλ°©μ‘μΌλ‘ μ§νν΄ λλ¦¬κ³ μλλ°μ. |
|
νΉμ§μΌλ‘ μ ν¬κ° λΆμκΈ°λ λ§μ΄ λ°κΏλ΄€κ³ λ μ€λμ μ¬λμ κ³μ μ΄λλ§νΌ λλμ λν΄μ μ΄μΌκΈ° ν΄λ³ΌκΉ ν©λλ€. |
|
νμ μν μμ λλμ λ μ€μ²νκ³ κ³μλ λ€ λΆ λͺ¨μκ³ μ΄μΌκΈ° λλ 보λλ‘ νκ² μ΅λλ€ κ·ΈλΌ λ€ λΆ μκ°ν΄ λλ¦¬κ² μ΅λλ€ λ€μ΄μ€μμ£ . |
|
λ€ μ€λ μλ°©μ‘ ν λ‘ μΉ΄νμμλ λλμ μλ―Έμ λν΄μ μ΄μΌκΈ° λλ λ³ΌκΉ νλλ°μ. |
|
μ΄ μ μΌμ μ΄μ λ΄μΌμ΄λ©΄ ν¬λ¦¬μ€λ§μ€κ³ μ§κΈ μν μ μμΌ λΆ μ§λκ³ μκ±°λ μ μ°νν΄λ‘μ€ ν μλ²μ§κ° μλΉν λ°λΉ μ§ κ·Έλ° μκ°μ
λλ€. |
|
μ΄λ΄ λ κ°μ‘±κ³Ό λλ μΉμ§λ€κ³Ό ν¨κ» 보λ΄μ
μΌ λ μ΄ κ·ν μκ° λ΄ μ£Όμ
μ μ€μ λ€ λΆ λ¨Όμ μκ°ν΄ λ리λλ‘ νκ² μ΅λλ€. |
|
λ¨Όμ &party-name1&μ μμμ
λλ€. |
|
μλ
νμΈμ. |
|
κ·Έλ¦¬κ³ μμ€λνκ΅ μ¬νμ¬μ
νκ³Όμ κ΅μμ
λλ€. |
|
μλ
νμΈμ. |
|
κ·Έλ¦¬κ³ μλ¦λ€μ΄ μ¬λ¨μ μμμ΄μ¬ μ
λλ€. |
|
μλ
νμΈμ. |
|
κ·Έλ¦¬κ³ μ¬λμ μ₯κΈ°κΈ°μ¦μ΄λλ³ΈλΆμ κ΅μ₯λμ΄μλλ€. |
|
μλ
νμΈμ. |
|
μ΄λ κ² λμ μ£Όμ
μ λ€μ νλ² κ°μ¬λ리ꡬμ. |
|
κ·Έλ¦¬κ³ μ€λ νΉλ³ν μ ν¬ ν λ‘ μΉ΄νμλ μ©μ°κ΅¬ μμλ΄μ¬ μΌν°μμ λ΄μ¬νλμ λ νμλ λΆλ€μ΄ λμμ£Όμ
¨μ΅λλ€. |
|
μ€λ λμμ£Όμ λΆλ€ λ€μ νλ² νμνκ³ μ§μ¬μΌλ‘ κ°μ¬λ립λλ€. |
|
λ μ΄λ° μκΈ°λ₯Ό νμ£ μ°λ¦¬μ¬νμλ μμ§λ 곡λ체 μμμ΄ λΆμ‘±νλ€ λλμ μμμ΄ λΆμ‘±νλ€ κΈ°λΆ λ¬Ένκ° μ μ°©λΌ μμ§ μλ€ κ·Έλ° μκΈ°λ€μ λ§μ΄ νλλ°μ. |
|
μ΄λ»κ² νλ©΄ κ·Έλ¬ν λ°λ»ν μ°λ¦¬λ€μ λ§μμ λ ν€μ°κ³ λ λλ μ μκ³ λ κ·Έλ° κ²μ μ΄λ ν μ λμ μ₯μΉλ‘ μ 보μν΄ λκ° μ μμκΉ |
|
κ·Έλ° λ¬Έμ λ€μ νλνλ μ΄μΌκΈ° λλ 보λλ‘ νκ² μ΅λλ€ λλμ΄ λλ체 μ νμνμ§ κ·Έλ¦¬κ³ μλ‘ μ μΈ μκΈ° κ² μ£ . |
|
κ·Έκ²λΆν° νλ² μκΈ°λ₯Ό νλ² ν΄ λ³ΌκΉ ν©λλ€ λ¨Όμ λ³νΈμ¬λκ»μ μκΈ°ν΄ μ£Όμκ² μ΅λκΉ. |
|
μκΈ° ν볡νκΈ° μν΄μμ£ . |
|
{laughing} μμ€μ μΌλ‘ λ€λ¦½λλ€. |
|
μ¬μ€ κΈ°λΆλΌλ μ§ λλμ΄λΌλ κ² μκΈ° μ£Όλ¨Έλμμ λμ΄ λκ°λκΉ μκΈ°νν
μν΄κ° λ κ² κ°μλ° μ€μ λ‘ λλ λ³Έ μ¬λλ§ μλλ€. |
|
{laughing} μ΄κ² μΌλ§λ μκΈ°κ° μ€μ€λ‘ ν볡ν΄μ§λμ§ κ·Έλμ μμ λ λλκΈ°λΆμ€λ
μ΄λΌλ λ§λ μꡬμ. |
|
λ μ ν¬λ€μ΄ μ΄λ κ² μμμ λ μ΄λ κ² λͺ¨κΈμ κ΄ν μ±
μ μ½μ΄λ³΄λ©΄ |
|
κΈ°λΆ ν΄ λ³Έ μ¬λνν
κ°μ λ λ¬λΌκ³ ν΄λΌ μ΄κ² λͺ¨κΈνλ μ¬λμ΄ μ²« λ²μ§Έ μμΉμΌλ‘ μκΈ°ν΄μ. |
|
κ·Έ μκΈ°λ λ¬΄μ¨ μκΈ°λλ©΄ ν΄λ³Έ μ¬λμ΄ μ¦κ±°μ°λκΉ λν κ°λ₯μ±μ΄ λ§λ€λ κ±°μ§μ μλ§ μ΄κ±΄ ν΄λ³΄μ
μΌ μ΄κ±° μ κ° μ무리 λ§μλλ €λ μμ©μꡬμ. |
|
μ€μ λλ 보μ
μΌ κ·Έ κΈ°μ¨ μ¦κ±°μμ μμκ² λ©λλ€. |
|
κ²°κ΅μλ μκΈ°νν
λμμ¨λ€ λΌκ³ νλ κ²μ΄ μμ μ¬λλ€μκ² λ§μ΄ ν½λ°°νλλ° μ₯κΈ°κΈ°μ¦ κ°μ κ²½μ°μλ λ΄κ° κΈ°μ¦μ νλ©΄ |
|
μ κ·Έκ²μ΄ κ²°κ΅μ λνν
λμμ¨λ€λ κ·Έ μ΄μ κ° λλ νλ©΄ λ΄κ° μΈμ λ μ§ νμκ° λμμ λ |
|
μ¬ν μ λ°μ μΌλ‘ κ·Έλ κ² κΈ°μ¦νλ κ·Έ λ¬Ένκ° νμ°λλ©΄ λ΄κ° νμκ° λμ λ κ·Έκ²μ΄ κ²°κ΅ λνν
ννμ΄ λμμ¨λ€ λΌκ³ ν΄μ |
|
μ€νμΈ κ°μ κ²½μ°μλ λ°±λ§ λͺ
λΉ μΌμμ¬ λͺ
μΌλ‘ μ μΈκ³μ μΌλ‘ κ°μ₯ λ§μ΄ κΈ°μ¦μ νκ³ μλλ° |
|
κ·Έλ° μμμ΄ κ²°κ΅μ λ΄κ² λμμ€λ κ²μ΄λ€λΌκ³ νλ μμμ΄ ν½λ°°νκΈ° λλ¬Έμ κ·Έλ κ² λλ€κ³ ν΄μ. |
|
λ°±λ§ λͺ
λΉ μΌμμ¬ λͺ
μ΄λΌλ κ²μ μ€μ κΈ°μ¦νλ |
|
μ μ«μκ° |
|
μμΉκ² μ£ κ·Έκ² μ΄λ£¨μ΄μ§λ €λ©΄ κΈ°μ¦ μμ½μ κ΅μ₯ν λ λ§μ μ¬λλ€μ΄ νκ² λ€μ. |
|
]""" |
|
formatted_prompt = f"Instruction: {prompt}\n output:" |
|
|
|
outputs = pipe_finetuned( |
|
formatted_prompt, |
|
temperature=0.1, |
|
top_k=50, |
|
top_p=0.95, |
|
repetition_penalty=1.2, |
|
add_special_tokens=True, |
|
streamer = streamer |
|
) |
|
|
|
print(outputs[0]["generated_text"][len(formatted_prompt):]) |
|
``` |
|
|
|
### results |
|
```python |
|
μ λͺ©: λλμ μλ―Έμ νμμ±μ λν ν λ‘ |
|
|
|
1. λλμ μλ―Έμ μ€μμ± |
|
- λλμ νΉμ λ μ§μ, νΉμ μ¬λλ€κ³Ό ν¨κ» νλ μκ°μ μλ―Ένλ€. |
|
- νΉλ³ν, ν¬λ¦¬μ€λ§μ€μ μ°νν΄λ‘μ€λ₯Ό ν¬ν¨ν μΌλΆ λ μ§μλ κ°μ‘±κ³Ό μΉμ§λ€κ³Ό ν¨κ» λλμ ν μ μλ€. |
|
- λλμ κ°μ‘±κ³Ό μΉμ§λ€κ³Ό ν¨κ» 보λ΄λ μκ°μ΄λΌλ μ μμ μ€μνλ€. |
|
|
|
2. λλμ νμμ± |
|
- 곡λ체 μμμ΄ λΆμ‘±νλ©°, κΈ°λΆ λ¬Ένκ° μ μ°©λμ§ μμλ€. |
|
- λλμ μμμ΄ λΆμ‘±νμ¬, κΈ°λΆ λ¬Ένκ° μ 보μλμ§ μμλ€. |
|
|
|
3. λλμ μλ‘ μ μκΈ° |
|
- λλμ κΈ°λΆλ‘λΆν° μ»λ κ²μ΄ μλλΌ, κΈ°λΆλ₯Ό ν΅ν΄ μ»λ κ²μ΄ μλλΌ, κΈ°λΆλ₯Ό ν΅ν΄ μ»λ κ²μ΄ μλλΌ, κΈ°λΆλ₯Ό ν΅ν΄ μ»λ κ²μ΄λΌλ μμμ΄ νμνλ€. |
|
- λλμ μμμ ν€μ°κ³ λλ μ μλλ‘ μ λμ μ₯μΉκ° νμνλ€. |
|
|
|
4. λλμ μμμ λν λ
Όμ |
|
- λ³νΈμ¬λ λλμ΄ μκΈ° ν볡μ μν κ²μ΄λΌλ μ견μ μ μνλ€. |
|
- λλμ΄ κΈ°λΆλ‘λΆν° μ»λ κ²μ΄λΌλ μ견λ μ μλμλ€. |
|
- λλμ΄ κ²°κ΅ νμκ° λλ κ²μ΄λΌλ μ견λ μ μλμλ€. |
|
``` |
|
|
|
### Inputs and outputs |
|
|
|
* **Input:** Text string, such as a question, a prompt, or a document to be summarized. |
|
* **Output:** Generated Korea text in response to the input, such an answer to a question, or a summary of a minutes. |
|
|
|
### Software |
|
|
|
Training was done using QLoRA |
|
|
|
|
|
## Usage and Limitations |
|
|
|
These models have certain limitations that users should be aware of. |
|
|
|
### Intended Usage |
|
|
|
Open Large Language Models (LLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
* Content Creation and Communication |
|
* Text Generation: These models can be used to generate creative text formats |
|
such as poems, scripts, code, marketing copy, and email drafts. |
|
* Research and Education |
|
* Natural Language Processing (NLP) Research: These models can serve as a |
|
foundation for researchers to experiment with NLP techniques, develop |
|
algorithms, and contribute to the advancement of the field. |
|
* Language Learning Tools: Support interactive language learning experiences, |
|
aiding in grammar correction or providing writing practice. |
|
* Knowledge Exploration: Assist researchers in exploring large bodies of text |
|
by generating summaries or answering questions about specific topics. |
|
|
|
### Limitations |
|
|
|
* Training Data |
|
* The quality and diversity of the training data significantly influence the |
|
model's capabilities. Biases or gaps in the training data can lead to |
|
limitations in the model's responses. |
|
* The scope of the training dataset determines the subject areas the model can |
|
handle effectively. |
|
* Context and Task Complexity |
|
* LLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* A model's performance can be influenced by the amount of context provided |
|
(longer context generally leads to better outputs, up to a certain point). |
|
* Language Ambiguity and Nuance |
|
* Natural language is inherently complex. LLMs might struggle to grasp subtle |
|
nuances, sarcasm, or figurative language. |
|
* Factual Accuracy |
|
* LLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* Common Sense |
|
* LLMs rely on statistical patterns in language. They might lack the ability |
|
to apply common sense reasoning in certain situations. |
|
|
|
### Ethical Considerations and Risks |
|
|
|
The development of large language models (LLMs) raises several ethical concerns. |
|
In creating an open model, we have carefully considered the following: |
|
|
|
* Bias and Fairness |
|
* LLMs trained on large-scale, real-world text data can reflect socio-cultural |
|
biases embedded in the training material. These models underwent careful |
|
scrutiny, input data pre-processing described and posterior evaluations |
|
reported in this card. |
|
* Misinformation and Misuse |
|
* LLMs can be misused to generate text that is false, misleading, or harmful. |
|
* Guidelines are provided for responsible use with the model, see the |
|
[Responsible Generative AI Toolkit](http://ai.google.dev/gemma/responsible). |
|
* Transparency and Accountability: |
|
* This model card summarizes details on the models' architecture, |
|
capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share |
|
innovation by making LLM technology accessible to developers and researchers |
|
across the AI ecosystem. |
|
|
|
Risks identified and mitigations: |
|
|
|
* Perpetuation of biases: It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* Generation of harmful content: Mechanisms and guidelines for content safety |
|
are essential. Developers are encouraged to exercise caution and implement |
|
appropriate content safety safeguards based on their specific product policies |
|
and application use cases. |
|
* Misuse for malicious purposes: Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided. Prohibited uses of Gemma models are outlined in the |
|
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
* Privacy violations: Models were trained on data filtered for removal of PII |
|
(Personally Identifiable Information). Developers are encouraged to adhere to |
|
privacy regulations with privacy-preserving techniques. |