german_llm_outputs / README.md
Florian Leuerer
README
510e903
|
raw
history blame
No virus
1.24 kB
---
title: German Llm Outputs
emoji: 🦀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 4.12.0
app_file: app.py
pinned: false
license: mit
---
# Dataset
The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Preprocessing:
- filtered german conversations
- took first user prompt
- deleted short prompts (less than 70 chars)
```python
dataset = load_dataset('lmsys/chatbot_arena_conversations')
def get_message(x):
x['message'] = [x['conversation_a'][0]]
return x
dataset = dataset.filter(lambda x: x['language'] == 'German')
dataset = dataset['train'].map(get_message)
dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70)
```
# Generation
I rely on the huggingface `conversational` pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later.
```python
messages = json.loads(Path('messages.json').read_text())
outputs = []
pipe = pipeline(
"conversational",
model=model_name,
torch_dtype="auto",
device_map=device,
max_new_tokens=1024,
trust_remote_code=True
)
for message in tqdm(messages):
output = pipe([message])
outputs.append(output)
```