File size: 4,029 Bytes

a3faa7a
 
5a2ceaa
 
73f924f
5a2ceaa
73f924f
 
 
 
 
 
 
 
 
 
a3faa7a
 
5a2ceaa
a3faa7a
5a2ceaa
a3faa7a
 
6b89ae0
017f37d
a3faa7a
 
 
 
5a2ceaa
a3faa7a
5a2ceaa
 
865c932
5a2ceaa
a3faa7a
 
 
5a2ceaa
a3faa7a
5a2ceaa
a3faa7a
017f37d
a3faa7a
 
 
5a2ceaa
a3faa7a
5a2ceaa
a3faa7a
 
 
 
 
5a2ceaa
 
 
 
 
 
 
 
 
 
 
 
 
a3faa7a
 
 
 
 
5a2ceaa
e264da0
5a2ceaa
 
 
 
 
a3faa7a
 
 
6e50ffc
73f924f

---
library_name: transformers
license: apache-2.0
language:
- de
datasets:
- devngho/culturax-mini-nonshuffled
- maxidl/FineNews-unfiltered
- djstrong/oscar-small
- LemiSt/gutenberg_de
- almanach/HALvest
- wikimedia/wikipedia
- D4ve-R/terra-xplain-cc-de
base_model:
- HuggingFaceTB/SmolLM-135M
pipeline_tag: text-generation
---

# Model Card for SmolLM-135M-de

A german version of [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md), trained to speak German by applying CPT for about 6 billion tokens.


If you are looking for a chat model, try [this](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de-merged) fine tune or the [corresponding adapter model](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de).

## Model Details

### Model Description

The base model is [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md), which I further trained on about 6 billion German-language tokens.

- **Model type:** Large Language Model (Llama architecture)
- **Language(s) (NLP):** German
- **License:** Apache 2.0
- **Finetuned from model:** [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M/blob/main/README.md)

## Uses

I mainly made this as a small experimentation model to quickly benchmark datasets etc. - since the model is so small, I am unsure about its usefulness for any real-world scenarios.

This is a base model without any chat fine tuning etc. and thus should not be used as-is. It outputs mostly correct German, which is what I tried to achieve.

If you are looking for a chat model, try [this](https://huggingface.co/LemiSt/SmolLM-135M-instruct-de) adapter. 

## Bias, Risks, and Limitations

This is a very small model and will output blatantly wrong information. I have not done any further filtering on the source datasets, so it is possible that the model will generate lewd or otherwise inappropriate content. Use with care.

I would **strongly** recommend against using this model in a production setting, at least without further fine tuning and preference optimization.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
# adapted from the original SmolLM repo
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "LemiSt/SmolLM-135M-de"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Rezept für einen leckeren veganen Schokokuchen:\n", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```

## Training Details

### Training Data

  - [devngho/culturax-mini-nonshuffled](https://huggingface.co/datasets/devngho/culturax-mini-nonshuffled)
  - [maxidl/FineNews-unfiltered](https://huggingface.co/datasets/maxidl/FineNews-unfiltered) CC-NEWS-2024-05 config, de split
  - [djstrong/oscar-small](https://huggingface.co/datasets/djstrong/oscar-small) unshuffled_deduplicated_de config
  - [LemiSt/gutenberg_de](https://huggingface.co/datasets/LemiSt/gutenberg_de)
  - [almanach/HALvest](https://huggingface.co/datasets/almanach/HALvest) de config
  - [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) 20231101.de config
  - [D4ve-R/terra-xplain-cc-de](https://huggingface.co/datasets/D4ve-R/terra-xplain-cc-de)

### Training Procedure

This was trained with axolotl, using full fine tuning (no LoRA etc). I used a sequence length of 2048 with an effective batch size of 512, learning rate of 0.003 with the adamw_bnb_8bit optimizer and a cosine scheduler.
Due to an error I made in calculating the token count, I accidentally trained for nearly 2 epochs, with the learning rate not reaching its proper minimum.