File size: 4,029 Bytes
a3faa7a 5a2ceaa 73f924f 5a2ceaa 73f924f a3faa7a 5a2ceaa a3faa7a 5a2ceaa a3faa7a 6b89ae0 017f37d a3faa7a 5a2ceaa a3faa7a 5a2ceaa 865c932 5a2ceaa a3faa7a 5a2ceaa a3faa7a 5a2ceaa a3faa7a 017f37d a3faa7a 5a2ceaa a3faa7a 5a2ceaa a3faa7a 5a2ceaa a3faa7a 5a2ceaa e264da0 5a2ceaa a3faa7a 6e50ffc 73f924f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
library_name: transformers
license: apache-2.0
language:
- de
datasets:
- devngho/culturax-mini-nonshuffled
- maxidl/FineNews-unfiltered
- djstrong/oscar-small
- LemiSt/gutenberg_de
- almanach/HALvest
- wikimedia/wikipedia
- D4ve-R/terra-xplain-cc-de
base_model:
- HuggingFaceTB/SmolLM-135M
pipeline_tag: text-generation
---
# Model Card for SmolLM-135M-de
A german version of [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md), trained to speak German by applying CPT for about 6 billion tokens.
If you are looking for a chat model, try [this](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de-merged) fine tune or the [corresponding adapter model](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de).
## Model Details
### Model Description
The base model is [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md), which I further trained on about 6 billion German-language tokens.
- **Model type:** Large Language Model (Llama architecture)
- **Language(s) (NLP):** German
- **License:** Apache 2.0
- **Finetuned from model:** [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md)
## Uses
I mainly made this as a small experimentation model to quickly benchmark datasets etc. - since the model is so small, I am unsure about its usefulness for any real-world scenarios.
This is a base model without any chat fine tuning etc. and thus should not be used as-is. It outputs mostly correct German, which is what I tried to achieve.
If you are looking for a chat model, try [this](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de) adapter.
## Bias, Risks, and Limitations
This is a very small model and will output blatantly wrong information. I have not done any further filtering on the source datasets, so it is possible that the model will generate lewd or otherwise inappropriate content. Use with care.
I would **strongly** recommend against using this model in a production setting, at least without further fine tuning and preference optimization.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
# adapted from the original SmolLM repo
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "LemiSt/SmolLM-135M-de"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Rezept für einen leckeren veganen Schokokuchen:\n", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```
## Training Details
### Training Data
- [devngho/culturax-mini-nonshuffled](https://huggingface.co/datasets/devngho/culturax-mini-nonshuffled)
- [maxidl/FineNews-unfiltered](https://huggingface.co/datasets/maxidl/FineNews-unfiltered) CC-NEWS-2024-05 config, de split
- [djstrong/oscar-small](https://huggingface.co/datasets/djstrong/oscar-small) unshuffled_deduplicated_de config
- [LemiSt/gutenberg_de](https://huggingface.co/datasets/LemiSt/gutenberg_de)
- [almanach/HALvest](https://huggingface.co/datasets/almanach/HALvest) de config
- [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) 20231101.de config
- [D4ve-R/terra-xplain-cc-de](https://huggingface.co/datasets/D4ve-R/terra-xplain-cc-de)
### Training Procedure
This was trained with axolotl, using full fine tuning (no LoRA etc). I used a sequence length of 2048 with an effective batch size of 512, learning rate of 0.003 with the adamw_bnb_8bit optimizer and a cosine scheduler.
Due to an error I made in calculating the token count, I accidentally trained for nearly 2 epochs, with the learning rate not reaching its proper minimum. |